Validation of the linear regression method to evaluate 1 population accuracy and bias of predictions for 2 non-linear models

Background: The linear regression method (LR) was proposed to estimate population bias and accuracy 17 of predictions, while addressing the limitations of commonly used cross-validation methods. The validity 18 and behavior of the LR method have been provided and studied for linear model predictions but not for 19 non-linear models. The objectives of this study were to 1) provide a mathematical proof for the validity of 20 the LR method when predictions are based on conditional mean, 2) explore the behavior of the LR method 21 in estimating bias and accuracy of predictions when the model ﬁtted is di ↵ erent from the true model, and 22 3) provide guidelines on how to appropriately partition the data into training and validation such that the 23 LR method can identify presence of bias and accuracy in predictions. 24 Results: We present a mathematical proof for the validity of the LR method to estimate bias and accuracy 25 of predictions based on the conditional mean, including for non-linear models. Using simulated data, we 26 show that the LR method can accurately detect bias and estimate accuracy of predictions when an incorrect 27 model is ﬁtted when the data is partitioned such that the values of relevant predictor variables di ↵ er in the 28 training and validation sets. But the LR method fails when the data are not partitioned in that manner. 29 Conclusions: The LR method was proven to be a valid method to evaluate the population bias and accuracy 30 of predictions based on the conditional mean, regardless of whether it is a linear or non-linear function of 31 the data. The ability of the LR method to detect bias and estimate accuracy of predictions when the model 32 ﬁtted is incorrect depends on how the data are partitioned. To appropriately test the predictive ability of a 33 model using the LR method, the values of the relevant predictor variables need to be di ↵ erent between the 34 training and validation sets. 35

is commonly evaluated with the statistic of predictivity, which is the correlation coe cient between the EBV 43 and phenotypes adjusted for fixed e↵ects of individuals in the validation set. Scaling predictivity by the 44 square root of heritability (h 2 ) provides an estimator for prediction accuracy of the EBV [10], defined as the 45 correlation between true and estimated breeding values. While accuracy estimated with CV has been widely 46 used to quantify the performance of genomic prediction models, pre-correcting phenotypes in the validation 47 set using estimates of fixed e↵ects obtained using the whole data set will overestimate the accuracy when 48 multiple levels of fixed e↵ects are present [11]. Additional limitations include that it can not be applied to 49 complex models (e.g., random regression models), indirect traits (e.g., unobserved latent traits), and traits 50 with low heritability (h 2 ) [11]. 51 To address these limitations of the CV methodology, Legarra and Reverter [11] proposed a linear regression 52 (LR) method to estimate the accuracy of genomic prediction. The LR method quantifies the population 53 accuracy and bias of predictions based on the comparison of EBV of individuals in the validation set estimated 54 using the training data set with the EBV of those same individuals estimated using the combined training 55 and validation sets. In the LR method literature, the training set is referred to as the partial data set (p) 56 and the combined training and validation data set is referred to as the whole data set (w). The LR method 57 was mathematically proven to provide unbiased estimates of the accuracy and bias of predictions for best 58 linear unbiased prediction (BLUP) by Legarra and Reverter [11] based on results from Reverter et al. [12]. 59 Macedo et al. [13] investigated the behavior and properties of the LR method by analyzing simulated data 60 with pedigree-based genetic models. They studied the LR estimators of population bias and accuracy of 61 predictions by using wrong values of h 2 in the analysis and by fitting wrong models, and claimed that "the 62 model, and that it works well for estimation of accuracy even when the model is not good". The validity 64 and performance of the LR method for a non-linear model was explored by Bermann et al. [14]. In their 65 study, they evaluated the performance of the LR method by fitting a threshold model to both simulation 66 and real data sets and concluded the LR method can be useful to estimate the directions of bias, dispersion, 67 and accuracy, though with di↵erent magnitudes. The original proof of the LR method [11] was based on the 68 setting where the whole data set had additional phenotype records relative to the partial data set. Belay et 69 al. [15] have recently shown that the LR method can also be applied to the setting where the whole data 70 set has additional genotypes (rather than phenotypes) relative to the partial data set. They used the LR 71 method to evaluate the bias and accuracy in single-step genomic predictions.

72
While the validity and performance of LR method has been explored using linear and non-linear models 73 in previous studies [13,14], a mathematical proof of its validity for non-linear methods of prediction has 74 not yet been presented. In addition, studies about the performance of the LR method when a model other 75 than the true model is fitted are still relatively scarce in the literature. The objectives of this study are to 76 1) present a mathematical proof of the validity of the LR method when predictions are based on conditional 77 mean, regardless of whether it is a linear or non-linear function of the data 2) investigate the ability of the 78 LR method to estimate the bias and accuracy of predictions when the fitted model di↵ers from that used to 79 generate the data, and 3) provide some guidelines on how to partition the data set such that the LR method 80 can detect bias and estimate accuracy of predictions when the incorrect model is fitted. Proof that Cov(û w ,û p ) = Var(û p ) 83 In the LR method, Legarra and Reverter [11] used var(x) to denote the variance of a random element, x, sampled from a single realization of the random vector (x). Here we will denote this variance by Var(x) = var(x). Let u denote the breeding value of a validation animal, andû p andû w denote the estimated breeding value of u obtained from partial data and whole data, respectively. Legarra and Reverter [11] proposed the LR method for BLUP by showing Cov(û w ,û p ) = Var(û p ) using the results from Reverter et al.
[12] and assumptions of Cov(u,û) = Var(û) and E(û p ) = E(û w ) = E(u). In the following, we prove the validity of the LR method for non-linear models by generalizing the proof of Cov(û w ,û p ) = Var(û p ) for prediction using the conditional mean, which may be non-linear. Let where y w , y p , and y r indicate a vector of phenotype records in the whole, partial, and validation data set, respectively. It is convenient to first show that E yr|yp (û w |y p ) =û p : Now, we write the Cov(û w ,û p ) as: where ✓ is the expected value ofû p . But the first term of this expectation can be shown to be null: because, as shown previously, E yr|yp (û w |y p ) =û p . Thus, the Cov(û w ,û p ) becomes: With the proof of Cov(û w ,û p ) = V ar(û p ), we showed the LR method holds for non-linear models. This proof 84 is similar in principle to that provided by Belay et al. [15], but we recognize that it is not limited to BLUP, 85 as invoked in that study, but is applicable to any method of prediction based on the conditional mean [16], 86 including for non-linear models.
where ✓ i = [Age115 i Shape i BW 65 i ] refers to three underlying latent variables for pig i of age at 115 kg, a shape parameter, and body weight at 65 days, and ✏ it is the residual. We simulated a heterogeneous residuals to mimic the real growth data for pigs using three di↵erent residuals across days 70 to 500 ( i.e., The three underlying latent variables ✓ i for individual i were considered correlated and modeled with a multivariate QTL e↵ects model. Using the QTL as markers, the simulated data were analyzed with two Bayesian Hierarchical models: 1) 96 the Gompertz model that was used for simulation, i.e. the true model, and 2) a quadratic growth model, 97 i.e. a wrong model. Variance components that were used to simulate the data were fitted into the true and 98 wrong models for analysis. The prediction performances of these two models were evaluated using the LR 99 method across 20 replicates. All analyses were performed in Julia [19].

100
Data analysis models 101 We analyzed the simulated data using the Bayesian Hierarchical Gompertz growth model (BHGGM) developed by Yu et al. [18], which integrates a Gompertz growth model, i.e. the true model, with a multi-trait marker e↵ects models. Following equation (1), the three underlying latent variables in the Gompertz growth model were assigned the following prior: where ⌃ e is the environmental variance-covariance matrix, which was assumed to have an inverse Wishart 102 prior, W 1 (S e , ⌫ e ). The prior for ✏ it had a null mean and age specific variances (as described above) to 103 allow fitting heterogeneous residuals. Flat priors were assigned to µ and CG i and the prior for ↵ j followed To fit the Bayesian Hierarchical quadratic growth model (BHQGM), i.e. the wrong model, we introduced a quadratic growth model for the non-linear function g(.) in equation (1): Following Legarra and Reverter [11], estimators of bias, inflation, and accuracy were calculated for EBV of 119 body weights of the individuals in the validation set using the LR method, as described below. 120 1 The LR estimator of population bias isˆ p =û p û w , which is the di↵erence between the mean EBV has been referred to as estimator of dispersion by Legarra and Reverter [11] and inflation by Belay  individual j and n trn is the total number of individuals in the training set.

135
In addition to these LR estimators of bias, inflation, and accuracy for body weights predictions, we also 136 calculated the "true" estimators of these parameters using the simulated values of u in place ofû w for each 137 day. Note that these "true" estimators can only be computed in a simulation study, and they are used to 138 study the performance of the LR estimators, which can be computed in real data analyses.

139
For the true and estimated bias statistics, we calculated their means for each day of age across all animals 140 in the validation set. These bias statistics were averaged across days within each replicate to test whether 141 their mean was significantly di↵erent from 0 using a t test. Similarly, true and estimated regression coe cient 142 statistics were averaged across days within each replicate to test whether their mean was significantly di↵erent 143 from 1 using a t test. 144

145
To better visualize the prediction performances across the fitted models and partitioning scenarios, we 146 randomly picked one individual from the validation set and displayed its simulated data against its predictions and cov(û w ,û p ) when fitting the true and wrong model for the three data partitioning scenarios (Table 1).

178
There was a non-significant di↵erence (P 0.74) between cov(u,û p ) and cov(û w ,û p ) when the true model 179 was fitted, but a significant di↵erence (P  0.004) was observed for each scenario when the wrong model 180 was fitted.

182
Based on the initial idea from Reverter et al.
[12], Legarra and Reverter [11] proposed the LR method to 183 quantify the prediction bias and accuracy of EBV at the population level. They proved the validity of LR 184 method for EBV from a linear model using standard BLUP theory and applied the LR method to a real 185 cattle data set [11]. While the LR method has also been applied to EBV from a threshold model [14], a 186 mathematical proof of its validity for a non-linear method of prediction has not been provided. In this study, 187 we presented a mathematical proof for the validity of the LR method for predictions based on the conditional 188 mean [16]. In our proof, we assume the partial data contains a subset of the phenotypes in the whole data. 189 Belay et al. [15] showed the LR method is also applicable to BLUP when the partial data contains a subset 190 of the genotypes in the whole data. The proof presented in the current paper is similar in principle to that 191 provided by Belay et al. [15]. Taken together, these two proofs show that the LR method is applicable When the wrong model (BHQGM) was fitted and the data were partitioned by animal, the true estimate of 203 bias was significant, but the LR estimate was not able to identify this bias ( Figure 3). Macedo et al. [13] also 204 observed that for a certain misspecification of the model, the LR method was not able to correctly detect 205 and estimate a bias. Figure 5 shows that when the wrong model was used, the true estimate of regression 206 ofû w onû p had a significant deviation from 1, and in this case the estimate of the regression coe cient 207 based on the LR method was also significantly di↵erent from 1, although di↵ering in magnitude from the 208 true estimate of regression coe cient. This is also consistent with the results observed by Macedo et al. [13].

209
The pattern of EBV against age were presented in Figure 2 (left column) for a randomly selected individual.

210
When the wrong model was fitted, the EBV from the partial and whole data sets deviated more from the 211 true BV than EBV from the true model did. However, even when the wrong model was used, the EBV 212 from partial and whole data sets were very similar. This explains why the estimate of bias based on the LR 213 method was not significant when the wrong model was used, although there was a true bias. Figure 7 shows 214 that, with the wrong model, the accuracy estimated by the LR method was slightly higher than the true 215 estimate of accuracy, which is consistent with Macedo et al. [13].

216
When the data were partitioned by age, the LR method was able to correctly detect a bias and inflation 217 when the wrong model was used (Figures 4 and 6). Figure 2 (middle column) shows the EBV of a randomly 218 selected individual when the data were partitioned by animal. When the wrong model was fitted, the EBV 219 estimated from partial and whole data sets both deviated from the true BV but the EBV based on the partial 220 set was quite di↵erent from that estimated from the whole set. This illustrates the significant bias that was 221 detected by the LR method for this scenario. Results for the partitioning by animal & age (right column in 222 Figure 2) were similar to those when partitioning by age. As in Macedo et al. [13], even with the use of a 223 wrong model, the accuracy estimated by the LR method was quite close to the true accuracy ( Figure 8).

224
The inconsistent bias estimates obtained with the LR method for di↵erent data partitioning strategies 225 suggests that the LR method captures di↵erent aspects of the model for di↵erent data partitions. When 226 the data were partitioned by animal, both the partial and whole data sets included phenotypes over the 227 range from 70 to 500 days. Thus the fit of the growth curve from the partial and whole data sets were 228 similar, even for the wrong model, although the fit might deviate from that using the true model. Fitting 229 the wrong growth model is only incorrect in the relationship between age and body weight within individual 230 but correctly models relationships between relatives. Thus to appropriately test the predictive ability of the 231 growth model using the LR method, we needed to predict the body weights of animals that are outside the 232 observed age range for animals in the training set. When the data are partitioned by age, the partial data set 233 has only body weights measured at ages up to 300 days, whereas the validation data set has body weights 234 measured at ages up to 500 days. Thus when we predict the body weights of the individuals in the validation 235 set based on the fit of the growth model from the partial data set, we are testing the predictive ability of the 236 growth model. In this case, the LR method was able to correctly detect a bias when using the wrong model 237 (bottom right plots in Figure 4 and Figure 6). In real data analyses, repeated k-fold LR can be used to test 238 the significance of bias. Or if Bayesian method is employed for the LR analysis, the posterior probability of 239 bias can be computed from a single partitioning of the data.

240
In general, to properly test the predictive ability of a model with the LR method, we need to use the were partitioned by animal, the training (partial) and validation sets included phenotypes for animals with 257 age ranging from days 70 to 500, the same age range as training data was used for the validation data and, 258 therefore, the LR method failed. However, when the data were partitioned by age, the model was trained 259 using phenotypes with ages ranging from days 70 to 300 and it was tested by predicting body weights for 260 animals with age ranging from days 301 to 500. In this case, the LR method was able to detect a bias when 261 using the wrong model. This was even true when the same genotypes were used in both the training and 262 validation sets, because to check if the model used for predicting longitudinal body weights is correct, the 263 relevant predictor variable is age.

265
In the present study, we provide a mathematical proof for the validity of applying the LR method to 266 predictions based on the conditional mean, regardless of whether it is a linear or non-linear function of data.

267
Using simulated data, we observed that the LR method was able to detect bias in predictions when an The authors declare that they have no competing interests.

288
Tables 289 Table 1: Significance (p-values) of tests for the di↵erence between Cov(u,û p ) and Cov(û w ,û p ) for the three data partitioning scenarios and the two models.