Accurate Estimation of Marker-Associated Genetic Variance and Heritability in Complex Trait Analyses

The emergence of high-throughput, genome-scale approaches for identifying and genotyping DNA variants has been a catalyst for the development of increasingly sophisticated whole-genome association and genomic prediction approaches, which together have revolutionized the study of complex traits in human, animal, and plant populations. These approaches have uncovered a broad spectrum of genetic complexity across traits and organisms, from a small number of detectable loci to an unknown number of undetectable loci. The heritable variation observed in a population is often partly caused by the segregation of one or more large-effect (statistically detectable) loci. Our study focused on the accurate estimation of the proportion of the genetic variance explained by such loci (p), a parameter estimated to quantify and predict the importance of causative loci or markers in linkage disequilibrium with causative loci. Here, we show that marker-associated genetic variances are systematically overestimated by standard statistical methods. The upward bias is purely mathematical in nature, unrelated to selection bias, and caused by the inequality between the genetic variance among progeny and sums of partitioned marker-associated genetic variances. We discovered a straightforward mathematical correction factor (kM) that depends only on degrees of freedom and the number of entries, is constant for a given experiment design, expands to higher-order genetic models in a predictable pattern, and yields bias-corrected estimates of marker-associated genetic variance and heritability. 1


Introduction 1
The introduction of chromosome-scale approaches for identify-22 markers in linkage disequilibrium (LD) with statistically signifi-23 cant QTL could be combined with phenotypes in an optimum 24 index for artificial selection in animals and plants. They showed 25 that the index weights and predicted genetic gains (∆G) from 26 marker-assisted index selection were functions of narrow-sense 27 heritability (h 2 = σ 2 A /σ 2 P ) and the proportion of the additive 28 genetic variance explained by marker loci in LD with the under-29 lying QTL (p A = σ 2 M A /σ 2 A ), where σ 2 A is the additive genetic vari-30 ance, σ 2 M A is the proportion of σ 2 A associated with statistically sig-31 nificant marker loci incorporated into the marker-assisted selec-32 tion (MAS) index, and σ 2 P is the phenotypic variance (Neimann-33 Sorensen and Robertson 1961;Lande and Thompson 1990;Viss-34 cher et al. 2006). More generally, p = σ 2 M /σ 2 G , where σ 2 G is the The linear mixed models needed for ANOVA estimation of σ 2 G 65 and σ 2 M are developed here. Consider an experiment where n G 66 entries (e.g., individuals or families) are phenotyped for a nor-67 mally distributed quantitative trait, genotyped for one or more 68 markers, and tested in a balanced completely randomized exper-69 iment design with r G replications/entry, n M marker genotypes/ 70 locus, and r M replications/marker genotype. Here, we study 71 balanced designs with a single explanatory marker locus (M1) 72 for simplicity. Solutions for more complex genetic models and 73 unbalanced data are shown in Appendices A, B, and C.

74
The LMM for estimating σ 2 G is: where y jk is the jk th phenotypic observation, µ is the population 76 mean, G j is the random effect of the j th entry, jk is the random 77 effect of the jk th residual, G j ∼ N(0, σ 2 G ), jk ∼ N(0, σ 2 ), j = 1, 2, 78 ..., n G , and k = 1, 2, ..., r G . The is the classic LMM for estimating 79 σ 2 G using ANOVA or REML methods (Table 1). 80 Suppose entries were genotyped for a single marker locus 81 (M1) in LD with a QTL affecting the trait. The LMM for estimat-82 ing the genetic variance associated with M1 (σ 2 M1 ) is: where y ijk is the ijk th phenotypic observation, M1 i is the random 84 effect of the i th marker genotype at locus M1, G : M1 ij is the 85 random effect of the j th entry nested in the i th marker genotype, 86 ijk is the random effect of the ijk th residual, i = 1, 2, 3, M1 i ∼ 87 N(0, σ 2 M1 ), G : M1 ij ∼ N(0, σ 2 G:M1 ), and ijk ∼ N(0, σ 2 ). This 88 is the classic LMM for estimating σ 2 M1 using ANOVA or REML 89 methods (Table 1).

91
Statistical analyses of LMMs (1) and (2) are needed to obtain 92 estimates of σ 2 M , σ 2 G , p, and H 2 M . ANOVA estimators of the 93 between-entry (σ 2 G ) and residual (σ 2 ) variance components for 94 LMM (1) with balanced data are: and where the mean sum of squares (MSs) are defined in Table ( . CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted April 9, 2020. . https://doi.org/10.1101/2020.04.08.032672 doi: bioRxiv preprint σ 2 = MS where n G:M1 is the number of entries nested in M1 and the MSs 1 are defined in Table 1, E(σ 2 M1 ) = σ 2 M1 , and E(σ 2 G:M1 ) = σ 2 G:M1 .

2
The residuals in LMMs (1) and (2) are identical when the data are 3 balanced. Hence, for a single marker locus, the classic estimator 4 of marker-associated genetic variance is: and the estimator of broad-sense marker heritability on an entry-6 mean basis is: whereσ 2 P is the phenotypic variance on an entry-mean basis.
whereȳ ij• is the phenotypic mean for the ij th entry, µ is the 17 population mean, M1 i is the random effect of the i th marker 18 genotype, var(M1 i ) = σ 2 M1 , G : M1 ij is the random effect of 19 entries nested in M1, var(G : M1 ij ) = σ 2 G:M1 ,¯ ij• is the residual 20 error, and var(¯ ij• ) = r −1 G σ 2 .

21
With the entry-mean vector (ȳ) as input, the LMM equivalent 22 to (9) is: where 1 n G is an n G -vector of ones, µ is the population mean, g is 24 an n G -vector of entry effects, g ∼ N(0, G),¯ is n G -vector of resid- where the Z c are design matrices and u c are independent random 33 effect vectors for c genetic factors in the model, e.g., marker loci, whereȳ is a vector of adjusted entry-level means. AMV and 42 ASV estimators of the variance components (g c ) in LMMs (10 The true variance for the residual effect in LMM (1) 56 is: where¯ i is the i th element of¯ , i = 1, 2, 3, ..., n, and¯ • is the 58 mean of¯ . We note that E(s 2 ) = E(σ 2 ) = σ 2 . The true variance 59 for the effect of entries nested in M1 in LMM (2) is: ,q i is the effect of the i th entry nested in 61 marker locus 1, i = 1, 2, 3, ..., n), andq • is the mean ofq across 62 entries. We note that E(s 2 G:M1 ) = E(σ 2 G:M1 ) = σ 2 G:M1 . Finally,

63
The true variance for the effect of a single marker locus (M1) in 64 LMM (2) is: where m = Z M1 u M1 , m i is the effect of the i th genotype for 66 marker locus 1, i = 1, 2, or 3, andm is the mean of m across 67 entries. We demonstrate in the results that E(s 2 This inequality is at the heart of the problem we are study-69 ing and drives the overestimation of p and H 2 M .

70
The bias of the AMV estimator of σ 2 M1 is: Similarly, the bias of the ASV estimator 73 is: whereθ ASV M1 is the ASV estimator of σ 2 M and E[θ ASV M1 ] is the ex-75 pected value ofθ ASV M1 .

76
As shown in the results, the bias-correction factor (k M ) is a 77 constant determined by the independent variable structure. The 78 relative bias, which is expressed as a proportion of the true value 79 (s 2 M1 ), is also constant for a given true value. The relative bias of and the relative bias of the ASV estimator is:

M1
The biases and relative biases are determined through computer 2 simulations.
3 Simulation Experiments 4 We used simulation to estimate the accuracy of AMV and ASV 5 estimators of p and H 2 M . Plot-level phenotypic observations 6 (y ijk ) for LMMs (1) and (2) where B h is the h th effect of the B locus, P i is the i th effect of the P 54 locus, HYP j is the j th effect of the HYP locus, G k : (B × P × HYP) 55 is the k th effect of entries nested in B × P × HYP marker loci, 56 and hijkl is the hijkl th residual effect. The data for RILs were bal-57 anced, whereas the data for marker genotypes were slightly

95
Through genetic analyses of quantitative phenotypes in plant 96 populations, we discovered that the sum of REML estimates 97 of σ 2 M and σ 2 G:M from LMM (2) were greater than the REML 98 estimate of σ 2 G from LMM (1) (Table 4). Intuitively, we expected 99 σ 2 M andσ 2 G:M to sum toσ 2 G ; however, our analyses show that 100 estimates of variance components from the individual LMMs 101 are not corrected for differences in the number of observations 102 for different factors in the models. While the estimation of p = 103 σ 2 M /σ 2 G seems straightforward, we found that this parameter is 104 4 Feldmann et al.
. CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted April 9, 2020. . https://doi.org/10.1101/2020.04.08.032672 doi: bioRxiv preprint systematically overestimated using standard ANOVA methods. We discovered and show below that ANOVA estimators of 7 σ 2 G from LMM (1) are equal to: (1) plex genetic models and unbalanced data are shown in Table 1 22 and Appendices A, B, and C. From LMM (1) and equation (3), 23 the ANOVA estimator of the between entry variance component 24 is: where SS G and SS are sums of squares for the between entry 26 and residual effect in LMM (1) and SS G = SS M1 + SS G:M1 (Ta-27 ble 1). After substituting SS M1 + SS G:M1 for SS G in (14), we 28 obtained: From (15), the sum of ANOVA estimators of σ 2 M1 and σ 2 G:M1 from 30 LMM (1) is greater than the ANOVA estimator of σ 2 G from LMM

31
(2): The bias-correction coefficients (k M ) range from zero to one 34 and are mathematical constants for a particular experiment de-35 sign and data structure (Tables 1-4). For a single marker locus 36 (M1), r M1 = n G /n M1 ; hence, r M1 < n G and 0 < k M1 < 1 (Table   37 1). Similar inequalities exist for the bias correction coefficients 38 (k M ) in more complicated genetic models and differ for different 39 levels of effects in the genetic model (Table 3;  and 2) and the original phenotypic observations (y ijk ) as input. 61 The AMV defintion of the total variance among observations for 62 LMM (9) is: From LMM (9), the AMV estimator of of the variance ex-72 plained by M1 is: where Z u M1 = I n M1 ⊗ 1 n G:M1 and u M1 is a vector of random ef-74 fects for M1. We note that E(σ 2 estimator of of the variance explained by G : M1 is: Genetic Variance Estimation Bias 5 . CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted April 9, 2020. . https://doi.org/10.1101/2020.04.08.032672 doi: bioRxiv preprint where u G:M1 is a vector of random entry nested in M1 effects. Hence, from (14) and (15), the sum of AMV estimators of σ 2 M1 2 and σ 2 G:M1 is greater than the ANOVA estimator of σ 2 G : The average semi-variance or average variance of differences 5 among observations leads to a definition of the total variance 6 that provides a natural way to account for heterogeneity of vari- structure in a generalized LMM and allows for missing and un-10 balanced data. The ASV measure of total variance is half the 11 average total pairwise variance of a difference between entries: and J n G is an n G × n G unit matrix (Piepho 2019 The ASV measure of total variance 20 can be decomposed into independent sources of variance (e.g.,

21
genetic and residual) according to LMM (12): plained by the c th genetic factor (u c ), genetic factors are marker 25 loci, interactions between marker loci, and entries nested in 26 marker loci, and θ ASV = (n G − 1) −1 tr(RP n G ) is the residual 27 variance.

28
Equation (17) can be decomposed into the expected values of 29 the sample variances of the random effects: Hence, the variance explained by the c th 33 genetic factor is: where c indexes marker loci, interactions between marker loci, 35 and entries nested in marker loci.

36
For a single marker locus M1, θ ASV M1 = E(s 2 M1 ). The ASV 37 estimator of the variance explained by M1 is: in Tables (1-3), k M varies across genetic models and experiment 43 designs.

44
The ASV estimator of the variance explained by G : M1 is: From (18), the coefficient for bias-correcting AMV estimates of 46 σ 2 M1 is: Note that this definition of k M1 is equivalent to the definitions 48 in Table 3 with r G factored out. Hence, from (14) and (15), the 49 sum of ASV estimators is equal to the ANOVA estimator ofσ 2 G : Thus, in contrast to (7) and (8), the bias-corrected estimator 51 of p for a single marker locus (M1) is: Similarly, the bias-corrected estimator of H 2 M for a single marker 53 locus is: whereσ 2 G +σ 2 /r G =σ 2 P is the phenotypic variance on an entry-55 mean basis (Lynch and Walsh 1998).
Feldmann et al.
. CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted April 9, 2020. . https://doi.org/10.1101/2020.04.08.032672 doi: bioRxiv preprint  Table 2). As predicted by (15) and (18)  The bias was greater for unbalanced than for balanced data 17 (Fig. 1). As shown for an F 2 population segregating 1 AA : 2    (Table 4).  (Table 4). The 119 two-fold difference between ASV estimates of H 2 M for the two 120 Genetic Variance Estimation Bias 7 . CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted April 9, 2020. . https://doi.org/10.1101/2020.04.08.032672 doi: bioRxiv preprint SNP markers in the strawberry example illustrate the challenge 1 with accurately estimating the proportion of the heritable vari-

41
We suspect that biases similar to those described for marker her-42 itability could be hidden in the estimators of these parameters, 43 a problem that warrants further study. The approach we de- The authors declare no competing interests.
where y hijk is the hijk-th phenotypic observation, µ is the pop- is the residual effect with var( hijk ) = σ 2 .

26
The entry-mean or ASV model for the analysis of two loci is: whereȳ hij• is the entry-mean,¯ hij• is the residual with 28 var(¯ hij• ) = r −1 G σ 2 , and the other variables are as defined for 29 (20).

37
Finally, the variance explained by the random effect of entries 38 nested in marker loci (u G:M1×M2 ) is: From (21), the coefficient for bias-correcting AMV estimates 40 of σ 2 M1 (the genetic variance explained by M1) is: Note that this definition of k M1 is equivalent to the definitions 42 in Table 3 with r G factored out. Similarly, from (22), the coeffi-43 cient for bias-correcting AMV estimates of σ 2 M1×M2 (the genetic 44 variance explained by the interaction between M1 and M2) is: This definition of k M1×M2 is equivalent to the definitions in Table   46 3 with r G factored out.

47
The ASV definition of the total genetic variance is: The ANOVA estimators of the variance components for LMM

49
(1) with balanced data are previously defined in (3) and (4) The copyright holder for this preprint (which this version posted April 9, 2020. . https://doi.org/10.1101/2020.04.08.032672 doi: bioRxiv preprint When the entries are adjusted means (BLUEs), the ASV defi-1 nition of marker heritability is: Substituting the appropriate values for M2 into (28)  anced data. As before, the phenotypic observations are entry-8 means (ȳ hi• ) and the LMM for the entry-mean analysis is (9). In 9 vector notation, (9) can be written as: where P = I n G − n −1 G J n G , Z u M1 = ⊕ h 1 n G: Hence, the general form for k M1 is:

22
Variance for Two Loci with Unbalanced Data 23 ASV estimators for an experiment with two loci are developed 24 here for unbalanced data. As before, the phenotypic observa-25 tions are entry-means (ȳ hij• ) and the LMM for the entry-mean 26 analysis is (20). In vector notation, (20) can be written as: of the genetic variance explained by the marker locus is: where P = I n G − n −1 G J n G , Z u M1 = (I n M1 ⊗ 1 n M1 ) ⊗ 1 n G:M1×M2 . Sub-36 stituting the appropriate values for M2 into (29) yields θ ASV M2 . 37 Similarly, the variance explained by the interaction between 38 marker loci 1 and 2 (u M1×M2 ) using the ASV approach is: where Z u M1×M2 = I n M1×M2 ⊗ 1 n G:M1×M2 . . CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted April 9, 2020. . https://doi.org/10.1101/2020.04.08.032672 doi: bioRxiv preprint Similarly, from (30), the coefficient for bias-correcting AMV 1 estimates of σ 2 M1×M2 (the genetic variance explained by the in-2 teraction between M1 and M2) is: here for unbalanced data. The phenotypic observations are entry-7 means (ȳ hij• ). In vector notation, the LMM for the entry-mean 8 analysis is: where n G:M1 h ×M2 i ×M3 j is the number of entries nested in the 10 hij th marker genotype, n G = ∑ hij n G:M1 h ×M2 i ×M3 j , and observa- where P = I n G − n −1 G J n G , Z u M1 = (I n M1 ⊗ 1 n M1 ) ⊗ 1 n G:M1×M2 . Sub-17 stituting the appropriate values for M2 into (31) yields θ ASV M2 .

24
Finally, the variance explained by the random effect of entries 25 nested in marker loci (u G:M1×M2×M3 ) is: From (31), the coefficient for bias-correcting AMV estimates 27 of σ 2 M1 (the genetic variance explained by M1) is: Similarly, from (32), the coefficient for bias-correcting AMV 29 estimates of σ 2 M1×M2 (the genetic variance explained by the in-30 teraction between M1 and M2) is: (33), the coefficient for bias-correcting AMV estimates 32 of σ 2 M1×M2×M3 (the genetic variance explained by the three-way 33 marker loci interaction) is: Table 2 Simulation Study Designs and Variables. One thousand samples were simulated for each of 21 study designs with different marker heritabilities (H 2 M ), n M = 3 genotypes/marker locus, one to three marker loci (m), r M replicates/marker genotype, n G entries, r G replicates/entry, and n total observations. The number of replications/marker genotype for study design 4 was equivalent to the expected number for the segregation of a co-dominant DNA marker in an F 2 population (1 AA : 2 Aa : 1 aa). The r M , n G , and r G for study designs 5 and 6 varied because 10% and 33% of the data were randomly deleted (treated as missing).  1 Accuracy of AMV and ASV Estimates of Marker Heritability. AMV and ASV estimates of H 2 M are shown for 1,000 segregating populations simulated for different numbers of entries (n G individuals or families), five replications/entry (r G = 5), true marker heritabilities ranging from 0 to 1, and one to three marker loci with three genotypes/marker locus (n M1 = 3). True marker heritabilities are shown on the x-axis. Uncorrected and bias-corrected estimates of H 2 M are shown on the y-axis, whereĤ 2 M are the uncorrected AMV estimates (red highlighted observations), andĤ 2 M * are the bias-corrected ASV estimates (blue highlighted observations). AMV and ASV estimates of H 2 M from simulations are shown for: (A) one locus with balanced data for n G = 540 entries (study design 1); (B) two marker loci (M1, M2, and M1 × M2) with balanced data for n G = 540 (study design 2); (C) three marker loci (M1, M2, M3, M1 × M2, M1 × M3, M2 × M3, and M1 × M2 × M3) with balanced data for n G = 540 (study design 3); (D) an F 2 population segregating 1:2:1 for one marker locus with r M = 675 for both homozygotes, r M = 1, 350 for the heterozygote, and n G = 540 (study design 4); (E) one locus for n G ≤ 540 entries where 10% of the phenotypic observations are randomly missing (the data are unbalanced) (study design 5); and (F) one locus for n G ≤ 540 entries where 33% of the phenotypic observations are randomly missing (the data are unbalanced) (study design 6).

Genetic Variance Estimation Bias 15
. CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted April 9, 2020. . Table 4 Uncorrected and bias-corrected estimates of marker-associated genetic variances (σ 2 M ) and heritabilities (H 2 M ) for three marker loci (B, P, and HYP) segregating in a sunflower recombinant inbred line (RIL) population (n G = 146) with balanced data (Tang et al. 2006) and two marker loci (AX396 & AX493) segregating in a strawberry genome-wide association study (GWAS) population (n G = 565) with unbalanced data (Pincot et al. 2018). The marker loci in the sunflower study were associated with QTL affecting seed oil content. The genetic variances associated with those loci were estimating using LMM (13). The marker loci in the strawberry study were associated with a locus (Fw1) affecting resistance to Fusarium wilt. The genetic variances associated with those loci were estimated using LMM (2). The coefficients (k M ) for bias-correcting AMV estimates of σ 2 M were found as described in Table 3 for balanced data and Appendix B for unbalanced data.

Uncorrected
Bias-Corrected . CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted April 9, 2020. . https://doi.org/10.1101/2020.04.08.032672 doi: bioRxiv preprint Figure 3 Effect of n G on the Bias of AMV and ASV Estimates of σ 2 M1 . Phenotypic observations were simulated for 1,000 populations segregating for a single marker locus with three genotypes (n M1 = 3), five replications/entry (n M1 = 5), and n G = 450, 900, 1,800, 3,600, or 7,200 entries/population (study designs 12-16). The marker locus was assumed to be in complete linkage disequilibrium with a single QTL with marker heritabilities (H 2 M1 ) equat to 0.50. (A) Distribution of the relative biases of AMV estimates of σ 2 M1 for different n G . The relative bias (RB[θ AMV M ] = 0.499) was identical across the variables tested. (B) Distribution of the relative biases of ASV estimates of σ 2 M1 for different n G . The relative bias (RB[θ ASV M ] = 0.00) was identical across the variables tested. 18 Feldmann et al.
. CC-BY 4.0 International license was not certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which this version posted April 9, 2020. . https://doi.org/10.1101/2020.04.08.032672 doi: bioRxiv preprint