Factors Affecting Response to Recurrent Genomic Selection in Soybeans

Herein we report the impacts of applying five selection methods across 40 cycles of recurrent selection and identify interactions among factors that affect genetic responses in sets of simulated families of recombinant inbred lines derived from 21 homozygous soybean lines. Our use of recurrence equation to model response from recurrent selection allowed us to estimate the half-lives, asymptotic limits to recurrent selection for purposes of assessing the rates of response and future genetic potential of populations under selection. The simulated factors include selection methods, training sets, and selection intensity that are under the control of the plant breeder as well as genetic architecture and heritability. A factorial design to examine and analyze the main and interaction effects of these factors showed that both the rates of genetic improvement in the early cycles and limits to genetic improvement in the later cycles are significantly affected by interactions among all factors. Some consistent trends are that genomic selection methods provide greater initial rates of genetic improvement (per cycle) than phenotypic selection, but phenotypic selection provides the greatest long term responses in these closed genotypic systems. Model updating with training sets consisting of data from prior cycles of selection significantly improved prediction accuracy and genetic response with three parametric genomic prediction models. Ridge Regression, if updated with training sets consisting of data from prior cycles, achieved better rates of response than BayesB and Bayes LASSO models. A Support Vector Machine method, with a radial basis kernel, had the worst estimated prediction accuracies and the least long term genetic response. Application of genomic selection in a closed breeding population of a self-pollinated crop such as soybean will need to consider the impact of these factors on trade-offs between short term gains and conserving useful genetic diversity in the context of the goals for the breeding program.


Introduction
founders represent a realized population structure adapted to maturity zones II, III and IV that is  Table 2. 206 For purposes of this manuscript we use the phrase 'model updating' to refer to retraining GP 207 models with up to 14 previous cycles of training data ( Figure S3). A preliminary analysis of TS 208 on genotypic values and prediction accuracies was conducted using RR-REML models trained 209 with data from the current cycle as well as 3, 5, 8, 10, 12, and 14 prior cycles. The results were 210 compared with responses from the RR-REML model updated with TS's comprised of data from 211 all prior cycles and a RR-REML model with no updating. Training sets for each cycle were 212 obtained by randomly sampling 1600 RILs from the set of 2000 simulated RILs in each cycle. 213 The most accurate prediction and maximum genetic response was obtained with training data 214 that is cumulatively added every cycle ( Figure S4 and S5). The results indicate that 3-5 prior 215 cycles of training data did not significantly improve prediction accuracies and responses relative 216 to models that were not updated. Also, the standardized genotypic values and prediction 217 accuracies, obtained using 10 to 14 prior cycles of data in the TS's, were not significantly 218 different than results based on TS's consisting of all prior cycles. Based on the results of this 219 preliminary study, we investigated responses to recurrent selection using TS's consisting of up to 220 14 prior cycles of selection as well as data from the current cycle. After the 14 th cycle, training 221 data consisted of 14 prior cycles of recurrent selection.

222
Modeled response to recurrent selection. The averaged genotypic value for each cycle, c, of 223 recurrent selection was modeled with a linear first order recurrence equation: Where c is a sequence of integers from 0 to 39 representing each cycle of recurrent selection 226 from cycle 1 to 40 and f0, f1 and g are constant functions of c. By rearranging the equation we 227 note that the response in cycle c+1 can be represented as An alternative representation of (4) for the situation of α ≠ 1 is

237
, where α is less than 1 for genotypic response to recurrent selection and y' represents the 238 asymptotic limit to selection (Goldberg 1958). To illustrate, values of the sequence of c=0 to 39, 239 with y0 = 0, for the range of α (0.6-0.9) and β (1.4-38) values are plotted in Figure 2. The curves 240 can be interpreted as response to selection as a function of the frequencies of alleles with additive 241 selective advantage, selection intensity, time and effective population size (Robertson 1960).

242
The parameters, yo, α, and β, were estimated with a non-linear mixed effects method 243 implemented in 'nlme' functions in the 'nlme' and 'nlshelper' packages (Pinheiro and Bates Since the limits of responses are asymptotic, the number of cycles before there is no longer 246 response to selection is referred to as the half-life (Robertson 1960;Dempfle 1974;Kang 1979;247 Cockerham & Burrows 1980;Kang and Namkoong 1980;Kang 1987;Kang and Nienstaedt 248 1987). From the first order recurrence equation, the half-life is estimated as ln (0.5)/ln (α), when 249 y0 is '0' and the asymptotic limit is estimated as y'.

250
Analyses of variance (ANOVA) of modeled response to recurrent selection.

314
More information on the analyses can be found in the R package 'SoyNAMPredictionMethods'.

315
Also, simulated data and code are available as part of the package (File S1 found at 316 http://gfspopgen.agron.iastate.edu/SoyNAMPredictionMethods_v2_2020.html). Supplemental 317 material including the R package can be found at https://figshare.com/s/f7d4e515e563beef550b 318 and http://gfspopgen.agron.iastate.edu/SoyNAMPredictionMethods_v2_2020.html describes 319 how to use the package. The complete process for fitting NLME models and ANOVA can be 320 found at http://gfspopgen.agron.iastate.edu/Vishnu_MS1/NLME_Models_Part_I.html .SoyNAM     (Table   346   S1). Also analyses of variances on subsets of 10 and 20 cycles of selection demonstrate that the 347 interactions were important from the earliest cycles (File S2: Table S1, S2). Further, the relative importance of factors on interaction effects were consistent (nQTL>SM>SI>H>TS) and significant in analyses of variance conducted on subsets of 10 and 20 selection cycles (Table S1).  (Table S1).

355
Relative to PS, the three parametric GP models provided greater initial rates of response, reduced  (Table S1). Selection 362 intensity is the third most important factor to affect the response metrics (Table S1). Consistent 363 with theory (Robertson 1960;Hill and Robertson 1968;Bulmer 1971Bulmer ,1976, greater SI's 364 (associated with retention of smaller numbers of RILs to initiate subsequent cycles, were 365 associated with more rapid response rates, shorter half-lives, faster loss of genetic variance and 366 significantly lower Rs values as the population approached it's limit to respond to selection.

367
The modeled responses were highly dependent on each of the factors, with nQTL showing the 368 greatest deviations from the average response (Table S1). From the modeled responses the half-369 life for a responsive population varied from 1.6 to 4.9 cycles for 40 QTL, whereas for 400 and 4289 QTL it ranged from 2.2 to 8.8 and 3 to 9.5 cycles respectively. The modeled responses 371 were highly correlated with the simulated responses (Pearson correlation coefficient: 0.94-0.97).

372
To illustrate the impact of nQTL, consider Rs values plotted across forty cycles of recurrent

384
As expected, responses to selection reflect declining genetic variances (Figure 9 and S17). The 385 loss of Sgv's across cycles is much faster with fewer simulated QTL than larger numbers of 386 QTL. Likewise the estimated prediction accuracies (Figure 10 and S18) approach zero as the 387 genotypic variance approaches zero. Average MSE for the GP models increase across cycles of 388 selection ( Figure 11 and S19) LD among markers approach zero as the genotypic variance To interpret the role of nQTL as a factor, it is important to recall that: 1) positive and negative 392 allelic effects were simulated to alternate sequentially at QTL (marker loci) that were distributed      If GP models are updated, the standardized genotypic variance (Sgv) decreases at a rate similar 467 to GP models that are not updated (Figure 9 and S17). There is no difference among GS methods 468 in terms of rate at which Sgv decreases. Also, model updating significantly improved estimated 469 prediction accuracies, rps, for all GP models except SVMRBF. Among RR-REML and Bayesian 470 GP models, model updating has a slightly larger impact on estimated accuracies and MSE using 471 RR-REML than with Bayesian GP models ( Figure 10, 11, S18 and S19). MSE were orders of 472 magnitude lesser for RR-REML than bayesian GP models with updates after the first 10-15 473 cycles of selection (Figure 11 and S19).

474
If models are updated using data from up to 14 prior cycles, the changes to genetic variance  reported herein is the first designed to reveal interactions among five factors that previously had 526 been shown to affect responses to selection. Our ability to detect and characterize interaction effects is enabled by use of a first order recurrence equation (eqn 4). Hopefully, our explanation 528 of how to implement recurrence equation models in available R packages will encourage others 529 to investigate recurrent genetic improvement designs. Even though our motivation for the use of 530 non-linear modeling in this study was restricted to the systematic investigation of significance of 531 variation in response due to the factors and the relative order of magnitude of interaction effects 532 of factors, non-linear mixed effects models have predictive power that is still relatively 533 unexploited in the study of limits due to recurrent selection. We plan to investigate the use of . We refer to this as a hub network design. Soybean breeders 556 subsequently take advantage of natural self-pollination to create RILs with sufficient seed for 557 replicated evaluations across many environments.

558
Our simulations attempted to emulate cycles of selection, crossing, and self-pollination currently 559 conducted by commercial and academic soybean breeders. Relative to outcrossing species 560 selecting homozygotes to participate in a hub network within each cycle will have smaller 561 effective populations and retain LD for more cycles of recurrent selection. Future studies are 562 needed to determine if the significant interaction effects that we found are due to population 563 structure and genome organization.

565
There has been considerable concern expressed about the limited genetic variability among  can be used to assess the impact of selection on genetic potential of populations in the long term.

582
In our simulations, the estimates of half-lives using 10 or 20 simulated cycles, are significantly 583 less accurate than estimates obtained with 40 simulated cycles of recurrent selection (Table S2).

584
Among the five factors we investigated, a soybean breeder can choose SM's, SI's and TS's. usually extremely expensive (Beavis 1994;Goring et al. 2001;Xu 2003). For a fixed budget, the 590 breeder will be faced with a trade-off between numbers of replicates and numbers of RIL's that 591 can be evaluated. In other words, H on an entry mean basis can be estimated, but not adjusted 592 without adding field plot resources, and nQTL and their contributions to genetic variance can be 593 estimated, but the estimates will be biased. projects evaluate a few to many dozen RIL's per family and future simulation studies, especially of two part systems, should consider whether relationships between evaluation units, RIL's and 688 selection units (Holland et al. 2003), possibly individuals, need to be used in design of the TS's. 689 We allowed the size of TS's to increase every cycle by adding data from prior cycles. Increasing 690 the number of RIL's per TS requires more computational resources. An alternative strategy is to 691 randomly sample subsets of data from each of the prior cycles to maintain a constant cumulative 692 training population size. It is also possible to assign weights to the samples from prior cycles to 693 place more weight on data from more recent cycles. These possibilities suggest determination of 694 an optimal combination of numbers of RIL's and weights that will provide maximum prediction 695 accuracy with minimal computational requirements. Some aspects of this optimization problem

731
In the future, it is possible that development of male sterile and insect pollination systems for soybean (Ortiz-Perez 2008; Davis 2020) will enable a cycle of RGS to be conducted using two or 36 three mating generations per year. This will enable an alternative genetic improvement system 734 based on decoupling genetic improvement from variety development (Gaynor et al. 2017). By 735 separating the two types of breeding projects, GS can be applied continuously. In the two part 736 system TS's will need to be composed of genotypic and phenotypic data obtained from annual 737 field trials of RIL's, although the TS's will be several selection cycles removed from the cycle 738 used to create the RILs.

739
Results reported herein suggest that RGS in a two part system will rapidly produce genetic gains    with (right panels) model updating using prior cycles as training sets for the four genotypic prediction models. Plots demonstrate decrease in maximum possible genotypic value due to loss of favorable alleles and increase in average genotypic value for 10 cycles (upper panel) and 40 cycles (lower panel) of selection. Phenotypic selection (PS) is not updated and hence is the same in the left and right panels. The top and bottom panels represent genotypic values for genetic architectures consisting of 400 simulated QTL responsible for 70% of phenotypic variability in the initial population. PS -Phenotypic Selection, RR-REML-Ridge Regression with Restricted Maximum Likelihood, BL -Bayes LASSO, and SVMRBF-Support Vector Machine with Radial Basis Kernel. Figure 6 Heat Map for Percent Gain in Rs Relative to PS for 0.7 H for GP Models without Updating: Heat map indicating standardized response relative to PS as percentage gain after 40 cycles of recurrent selection using genomic prediction models without updated training sets for 40, 400, 4289 simulated QTL responsible for 70% of phenotypic variability in the initial population. Blue to red shaded cells represent increasing gain in response relative to PS. RR-REML-Ridge Regression with Restricted Maximum Likelihood, BL -Bayes LASSO, and SVMRBF-Support Vector Machine with Radial Basis Kernel. Figure 7 Heat Map for Percent Gain in Rs Relative to PS for 0.7 H for GP Models with Updating: Heat map indicating standardized response relative to PS as percentage gain after 40 cycles of recurrent selection using genomic prediction models with updated training sets from up to 14 prior cycles of selection for 40, 400, and 4289 simulated QTL responsible for 70% of phenotypic variability in the initial population. Blue to red shaded cells represent increasing gain in response relative to PS. RR-REML-Ridge Regression with Restricted Maximum Likelihood, BL -Bayes LASSO, and SVMRBF-Support Vector Machine with Radial Basis Kernel.  Standardized Genotypic Variance (Sgv) for Comparison of GS methods with and without Updating for 0.7 H and Top 10% Selected Fraction Standardized genotypic variance without training set updating (left panels) and with training set updating using prior cycle training data (right panels) for the four GP models. PS has no updating and hence is the same in both left and right panels. A) Training data from up to 14 prior cycles for 40 simulated QTL (top), 400 simulated QTL (middle) and 4289 simulated QTL (bottom) responsible for 70% of phenotypic variability in the initial population and top 10% of RILs with the greatest predicted values. PS -Phenotypic Selection, RR-REML-Ridge Regression with Restricted Maximum Likelihood, BL -Bayes LASSO, and SVMRBF-Support Vector Machine with Radial Basis Kernel.