Using large soybean historical data to study genotype 1 by environment variation and identify 2 mega-environments with the integration of genetic and 3 non-genetic factors

Soybean ( Glycine max (L.) Merr.) provides plant based protein for global food production and 22 is extensively bred to create cultivars with greater productivity in distinct environments. Plant 23 breeders evaluate new soybean genotypes using multi-environment trials (METs). Application 24 of METs assume that trial sites provide representative environmental conditions that cultivars 25 are likely to encounter when sold to farmers. Thus, it is important to understand the patterns 26 of genotype by environment interactions (GEI) that occur in METs. In order to evaluate GEI for 27 soybean seed yield and identify mega-environments, historical data were investigated with a ret- 28 rospective analysis of 39,006 unique experimental soybean genotypes evaluated in preliminary 29 and uniform trials conducted by public plant breeders from 1989-2019. Mega-environments (MEs) 30 were identified using yield records of lines from the annual trials and geographic, soil, and mete- 31 orological records at the trial locations. Results indicate that yield variation was mostly explained 32 by location and location by year interactions. The static portion of the GEI represented 26.30% 33 of the total yield variance. Estimates of variance due to genotype by location were greater than 34 estimates of variance due to genotype by year interaction effects. A trend analysis further indi- 35 cated a two-fold increase in the genotypic variance. Furthermore, the heterogeneous estimates 36 of genotypic, genotype by location, genotype by year, and genotype by location by year vari- 37 ances, were encapsulated by distinct probability distributions. The observed target population of 38 environments (TPE) can be divided into at least two and at most three MEs, thereby suggesting 39 improvements in the response to selection can be achieved when selecting directly for clustered 40 (i.e. regions, ME) versus selecting across regions. Clusters obtained using phenotypic data, lat- 41 itude, and soil variables plus elevation, were the most effective. In addition, we created the R 42 package “SoyURT” that contains the data sets used in this work. 43


First-stage analyses 218
The rst-stage analyses were previously performed by the collaborators, i.e., public soybean 219 breeders, before the data were submitted to the USDA for aggregating and reporting. Individual 220 trials within locations were analyzed using a model in which genotypes and blocks were consid-221 ered xed e ects, yielding eBLUE values, i.e., entry means for genotypes (y 1 ). The eBLUE values 222 were then analyzed with the following model to obtain an estimate of the genotypic variance 223 (σ 2 G ) that was subsequently used to estimate reliability (i 2 ) on an entry-mean basis: where y 1 is the vector of entry means reported for each trial in the PDF les, µ is the intercept,

225
Z g (m × m) is the incidence matrix of genotype e ects, g (m × 1) is a vector of genotype random 226 e ects with g ∼ N(0, σ 2 G I), and is a vector of residuals with ∼ N(0, Σ 1 ). The residual variance , where σ 2 G is the genotypic variance and σ 2 is the residual variance 233 (B 2020, p. 173). Note σ 2 G could also be estimated from the variance of entry means (σ 2 F ), 234 where σ 2 F = σ 2 r + σ 2 G . 235

Second-stage analyses 236
Second-stage analyses utilized data from multiple trials with common entries among trials at the 237 same location within a year (e.g. PT-A, PT-B, and UT), which were analyzed using: 238 y 2 = µ + X t t + X g g + where y 2 (mj×1) is a vector of eBLUE values for m genotypes evaluated across j trials at location 239 l, µ is the intercept, X t (mj × j) is the incidence matrix of xed e ects of trials, t (j × 1) is a 240 vector of xed e ects of trials, X g (mj × m) is the incidence matrix of xed e ects of genotypes, 241 g (m × 1) is a vector of xed e ects of genotypes and is a vector of residuals with ∼ N(0, Σ 1 ).

242
The elements of the estimated residual variance matrix Σ 1 (mj × mj) were obtained from the

Multi-location and multi-year analyses
For the third-stage of analyses the following "baseline" model was used to obtain estimates of 249 variance components across multiple locations and years: 250 y 3 = µ + X l l + Z g g + Z g.l g.l + Z y y + Z g.y g.y + Z l.y l.y + Z g.l.y g.l.y + where y 3 (mjt × 1) is a vector of eBLUE values for m genotypes evaluated across j locations 251 and t years, µ is the intercept, l (j × 1) is a vector of xed e ects of locations, g (m × 1) is a 252 vector of random e ects of genotypes with g ∼ N(0, σ 2 G I), g.l (mj × 1) is a vector of random 253 e ects of genotype by location interactions with g.l ∼ N(0, σ 2 GL ), y (t × 1) is a vector of random 254 e ects of years with y ∼ N(0, σ 2 Y ), g.y (mt × 1) is a vector of random e ects of genotype by year 255 interaction with g.y ∼ N(0, σ 2 GY ), l.y (jt × 1) is a vector of random e ects of location by year 256 interaction with l.y ∼ N(0, σ 2 LY ), g.l.y (mjt × 1) is a vector of random e ects of genotype by 257 location by year interaction with g.l.y ∼ N(0, σ 2 GLY ), and is a vector of residuals with ∼ N(0 258 Σ 2 ). X l (mjt × j), Z g (mjt × m), Z g.l (mjt × mj), Z y (mjt × t), Z g.y (mjt × mt), Z l.y (mjt × jt), 259 and Z g.l.y (mjt × mjt) are incidence matrices for their respective e ects. The elements of the 260 residual variance matrix Σ 2 (mjt × mjt) were obtained from model 2. Anderson-Darling (AD) goodness-of-t statistics (S 1986) were considered to select the Herein, "ME" and "cluster" are used interchangeably. We clustered 63 locations using six criteria: With the exception of Lat2 and Lat3, the optimal number of clusters was then de ned based 289 on the Silhouette and Elbow methods using the package "factoextra" (K and M 290 2020), followed by a K-means clustering with the R base function kmeans() allowing for a maxi-291 mum of 1,000 iterations and 100 multiple initial con gurations of the K groups.
where y 3 (mj × 1) is a vector eBLUE values of m genotypes in j locations, µ is the intercept, l 331 (j × 1) is a vector of xed e ects of locations, g (m × 1) is a vector of xed e ects of genotypes, 332 and is a vector of residuals with ∼ N(0, Σ 2 ). X l (mj × j) and X g (m × m) are incidence 333 matrices for their respective e ects, and Σ 2 (mj × mj) was previously de ned. to assess the relative e ectiveness of clustering environments into MEs. As previously demon-352 strated, CR/DR can be determined using variance components obtained from linear models: 353 y 3 =X r r + Z l(r) l (r) + Z g g + Z g.lr g.l (r) + Z y y + Z y.r y.r + Z g.r g.r + Z lr.y l (r) .y + Z g.y g.y + Z g.y.r g.y.r + Z g.y.lr g.y.l (r) + where r is a vector of xed e ects of clusters, and l (r) , g.l (r) , y.r, g.r, l r .y, g.y.r, and g.y.l r , are 354 random vectors with speci c variances of locations within clusters, genotype by location within 355 clusters interaction, year by clusters interaction, genotype by cluster interaction, locations nested 356 in clusters by year interaction, and genotype by year by location within clusters, respectively.

357
X r , and Z lr up to Z g.y.lr , are incidence matrices for their respective e ects and dimensions. The 358 remaining model terms were previously de ned.

359
Estimates of variance components from model 5 were used to obtain CR/DR: where ρ g is the correlation between estimated genotypic e ects in the non-clustered and clustered sets of environments and i 2 L and i 2 SR are the estimated reliabilities of genotype means in the non-362 clustered and clustered sets of environments, respectively. If CR/DR < 1, response to selection 363 will be more e ective if selections are made within clusters (A . 2000a; B K 364 2017). Note that it is possible for CR/DR > 1, indicating that selection will be more e ective 365 if selection is based on eBLUE values obtained from non-clustered environments. As per (A
2000a) The terms in equation 6 are de ned as follows: terms were omitted due to the lack of replicated data within environments.

373
The Jaccard similarity coe cient was used to compare the coincidence of locations within 374 cluster across clustering-types. For the sake of simplicity, only the σ 2 GR was estimated for each 375 clustering-type with the jackknife approach presented in section 5.3.4. Lastly, to evaluate if lo-376 cations were simply allocated to clusters by chance, we randomly assigned locations into two, 377 three, and four clusters, and assessed its CR/DR. This process was repeated 100 times.  Most of the estimated phenotypic variance for seed yield among annual soybean eld trials has 390 been due to location (σ 2 L ) and location by year interaction (σ 2 LY ) e ects (Table 2). Among the estimated GEI variance components (σ 2 GL + σ 2 GY + σ 2 GLY ), the static contribution (σ 2 GL ) represented 392 26.30% (Table 2, model M3-1).

393
Estimated variance components reveal distinct multi-modal distributions over time ( Figure   394 2). For example, the estimated genotypic variance (σ 2 G ) more than doubled from ∼ 3.54 ( of location, year, and location by year interaction variances ( Figure A1).

401
Several multi-modal models were evaluated for goodness of t for the empirical distributions 402 of variance components (Table A4). The best-t models for variance components across time con- are reported for the selected models (Table 3).

410
The criteria for determining best-t models included low AIC and BIC values, goodness of t The six clustering criteria revealed that the observed 63 environments can be divided into at 416 least two and at most three MEs. Clusters based on PHE criteria had the best (lowest) relative 417 e ectiveness value, CR/DR = 0.62 (Table 4). This value was computed using results from analysis 418 from model 5, and PHE determined by estimates of genotype by location interaction (Σ gl ) from 419 model M3-18 (Table 1). The simplest, but unrealistic model (M3-1) assumed homogeneous and 420 independent variances and had the greatest AIC value.

421
For PHE, the optimal number of clusters was three, while for SoilE, WW, and WA it was two.

422
For all clustering types, both Silhouette and Elbow criteria indicated similar results ( Figure A4).

423
The number of environments within clusters ranged from 7 to 54 (Table 4 and Figure 5). Calcu-424 lated Jaccard distance metrics among the clusters created by di erent criteria showed that many values. The greatest similarity was between clusters 1 and 2 from PHE and Lat2, respectively (Ta-427 ble A5).

428
The best-t model (M3-18) accounted for heterogeneous variances and pairwise genetic cor-429 relations with FA matrices of order k = 5 and k = 2 for Σ gl and Σ gly , accounting for 90.2% and 430 11 87.2% of the genetic variances, respectively (Table 1). Pairwise genetic correlations between the 431 63 observed locations show a higher average correlation within phenotypic clusters compared to 432 across phenotypic clusters ( Figure A5).

433
Prior to k-means clustering, a non-linear PCA was performed with centered and scaled en- for the selected window. The selected windows (i.e., with the highest Pearson correlation) were 441 smaller than 20 days, and most of them started at the beginning of the cropping season (Table A1 442 and Figure A9). For WW and for the sake of simplicity, only the highest window for each MV are 443 presented (Table A2), although locations were clustered as described in section 5.3.5.3.

444
The e ectiveness of clustering was assessed using CR/DR, which compares the response to units. When locations were randomly assigned to two, three, or four clusters, the mean CR/DR 460 values were always bigger than one, with large decreases in the reliability (Table 4).

461
REML estimates of variances decreased for σ 2 G when clusters were included in the complete 462 dataset, with the exception of WW. The observed ratios σ 2 GR /σ 2 G and σ 2 GR /σ 2 GL were greather 463 than 0.50 for PHE, SoilE, and Lat2. In terms of the partitioning of σ 2 GL , the σ 2 GL(R) portion was 464 substantially reduced for PHE, Lat2, and Lat3. The σ 2 GR was better captured by PHE, SoilE, and 465 Lat2, being ine ective for WW (estimate bounded at zero due to REML properties). On the other 466 hand, for σ 2 GLY , just a small portion of the variation was captured by σ 2 GRY (Table 2). Both PHE and Lat2 clustering types were able to greatly capture σ 2 GR according to analysis for each group 468 of years ( Figure 6). Lastly, large reductions in the variation of years (σ 2 Y ) were observed for SoilE 469 and Lat2. interactions involving genotypes and years in MZs II and III, we included eBLUE values from the 508 PYTs, thus providing at least two years of data for a broader base of genotypes.

509
Even with inclusion of data from PYTs there is potential for bias in the estimates of σ 2 GLY , σ 2 GL , 510 σ 2 GY , and σ 2 G due to the phenomenon of missing data between years. Many of the experimental 511 genotypes evaluated in the PYTs will be culled before becoming entries in a subsequent year of 512 URTs. As a consequence, the missing data are missing at random (MAR program, in order to be con dent with the REML estimates of variances obtained from METs, 560 they recommended that at least seven to eight years of trials should be included. This result and 561 the fact that soybean breeding cycles in public programs require seven to eight years, we created 562 four subsets of data consisting of seven to eight years. The reader should keep in mind that the 563 goal of minimizing bias in estimators is distinct from the magnitude and proportion of estimates 564 of variance components. The former is a concern for algorithmic estimators while the latter is of 565 concern for breeding decisions about whether to use more locations or more years in their crop 566 species and for traits under selection.

567
Estimates of variance components were also used to t parametric probability distributions 568 and to quantify the e ectiveness of dividing the sampled 63 environments/locations into MEs.

569
For the distributions, a jackknife resampling approach was implemented and consisted of leaving-570 one-environment out and estimating the variance components in each group of years (1989-1995, 571 1996-2003, 2004-2011, 2012-2019 (Table 2). This is a valid estimate, however, it does not allow a trend quanti cation. The GEI trend 584 are crucial as they can be incorporated in simulation pipelines to depict genetic and non-genetic 585 trends, which is current a topic under investigation. Herein, we identi ed MEs through reliable estimates of variance components. However, other 634 environmental subsets would have been formed using di erent clustering strategies (B 635 . 2008). Given this type of approach is an unsupervised learning (i.e., we do not know the 636 truth about MEs), the objective is always to discover an interpretable grouping of members. We 637 addressed interpretation using e ectiveness of clustering (A in the historical data should we go in order to take maximum advantage of the data in current models. It is also worth investigating if modelling maturity groups (specially when more data is 651 considered) would enhance the ability of nding meaningful MEs using phenotypic models.

653
We dissected the sources of soybean seed yield variation using reports from Soybean Cooperative

654
Tests for maturity groups II and III. We determined that sampled sets of environments can be split 655 into mega-environments according to phenotypic, geographic, and meteorological information.

656
Reasonable estimates of variance components are essential for analyses of data from historical 657 eld trials. Furthermore, it was possible to monitor trends in variance components involving 658 genotypes in terms of parametric probability distributions. Historical eld trials also evaluate 659 traits like seed quality and size, iron de ciency chlorosis, green stem, seed oil, and protein content.

660
The approach presented herein can be applied to variation of multiple economically important 661 quantitative traits. Finally, in addition to the practical and theoretical results applied to soybean 662 genetic improvement, the analysis performed in this study may be applied to quantitative traits 663 evaluated in any crop using multi-environment trials.  1989-1995, 1996-2003, 2004-2011, and 2012-2019. Empirical estimates were obtained using a jackknife leave-one-location out method. Vertical bars on the x-axis represent point estimates across all years. . Empirical estimates were obtained using a jackknife leave-one-location out method. The best-t models for PDF's are presented with di erent colors, and include the name of the distribution and its mixture number.
The evaluated variance-covariance structures were identity (I), diagonal (D), and factor analytic (FA k ) from order 691 k = 1, ..., 6. Σ 2 is the residual variance matrix assumed to be known from single trial and location analysis.        1989-1995, 1996-2003, 2004-2011, and 2012-2019, soil + elevation (SoilE), latitude split into two groups (Lat2), latitude split into three groups (Lat3), weather across years (WA), and weather within years (WW) clustering types.   1989-1995, 1996-2003, 2004-2011, and 2012-2019. Figure A2: Cumulative distribution function (CDF) for the best-t models according to the genotypic (A), genotype by location (B), genotype by year (C), and genotype by location by year (D) variances from jackknife. In each plot, the legends states for name of the distribution followed by its mixture number.      Figure A6: Scatterplots of the scaled and centered soil variables and elevation. Pearson correlation is displayed on the right. Variable distribution is available on the diagonal. The labels SV1, SV2, to SV8 are the soil variables described in section 5.3.5.  Figure A7: Scatterplots of the scaled and centered weather variables computed across years. Pearson correlation is displayed on the right. Variable distribution is available on the diagonal.The labels MV1, MV2, ..., to MV19, are the weather variables described in Table A1.  Figure A8: Scatterplots of the scaled and centered weather variables computed within years. Pearson correlation is displayed on the right. Variable distribution is available on the diagonal. The labels MV1, MV2, ..., to MV19, are the weather variables described in Table A2.  Figure A9: Exhaustive search from the critical environmental window computed from the Pearson correlation between the weather variables across years and the genotype by location deviations of each environment (location-year combination). The dots depict the highest window correlation. The labels MV1, MV2, to MV19 are the weather variables described in Table A1. Table A5: Jaccard similarity matrix between clustering-types (phenotypic -PHE, soil and elevation -SoilE, latitude split into 2 clusters -Lat2, latitude split into 3 clusters -Lat3, weather across years -WA, and weather within years -WW).