INCORPORATING A DYNAMIC GENE-BASED PROCESS MODULE INTO A CROP SIMULATION MODEL

Dynamic crop simulation models are tools that predict plant phenotype grown in specific environments for genotypes using genotype-specific parameters (GSPs), often referred to as “genetic coefficients.” These GSPs are estimated using phenotypic observations and may not represent “true” genetic information. Instead, estimating GSPs requires experiments to measure phenotypic responses when new cultivars are released. The goal of this study was to evaluate a new approach that incorporates a dynamic gene-based module for simulating time-to-flowering for common bean (Phaseolus vulgaris L.) into an existing dynamic crop model. A multi-environment study conducted in 2011 and 2012 included 187 recombinant inbred lines (RILs) from a bi-parental bean family to measure the effects of quantitative trait loci (QTL), environment (E), and QTL×E interactions across five sites. The dynamic mixed linear model from Vallejos et al. (2020) was modified in this study to create a dynamic module that was then integrated into the CSM-CROPGRO-Drybean model. This new hybrid crop model, with the gene-based flowering module replacing the original flowering component, requires allelic makeup of each genotype being simulated and daily E data. The hybrid model was compared to the original CSM model using the same E data and previously estimated GSPs to simulate time-to-flower. The integrated gene-based module simulated days of first flower agreed closely with observed values (root mean square error of 2.73 days and model efficiency of 0.90) across the five locations and 187 genotypes. The hybrid model with its gene-based module also described most of the G, E and G×E effects on time-to-flower and was able to predict final yield and other outputs simulated by the original CSM. These results provide the first evidence that dynamic crop simulation models can be transformed into gene-based models by replacing an existing process module with a gene-based module for simulating the same process.

The hybrid model with its gene-based module also described most of the G, E and G×E effects on 33 time-to-flower and was able to predict final yield and other outputs simulated by the original CSM. 34 These results provide the first evidence that dynamic crop simulation models can be transformed 35 into gene-based models by replacing an existing process module with a gene-based module for 36 simulating the same process. 37

Introduction 38
Scientific advances in understanding plant genes combined with advances in technologies for 39 rapidly and inexpensively identifying genetic makeup of plants (Thomson 2014; Rasheed et al. 40 2017) have fueled considerable interest in using genetic information to predict plant phenotypes. 41 Analytical tools are now available to identify the genes that are associated with the variation in 42 different plant traits. These bioinformatics tools also can identify important gene-by-environment 43 (G × E) interactions that contribute to observed variations in specific traits (Yin et al. 2018). Rapid 44 progress in genome-wide association studies (GWAS) has enabled researchers to identify genes 45 associated with variations in human diseases (Bush and Moore 2012). 46 Genome-wide prediction models that use GWAS also have become powerful tools for 47 improving crops such as tropical rice (e.g., Spindel et al. 2016). The GWAS approach has been 48 implemented in recent work in other crops (Brown et  Differences among cultivars are represented by empirical genotype-specific parameters (GSPs). 58 However, these models do not use information on variations in genes among the cultivars. Instead, 59 the GSPs for each genotype must be estimated using data from laboratory or field studies (Hunt et 60 al. 1993;Anothai et al. 2008; Buddhaboon et al. 2018). 61 predicted dynamic rates. As a result, this use of existing relationships makes it difficult to identify 85 G and G × E effects from field data, which can be seen in the expanded original model form 86 published by Wallach et al. (2018). Note that this expanded functional form inherently includes 87 many G × E interaction terms that may or may not exist. Incorporating genetic information into an 88 original model's functional form entangles G, E, and G × E effects and, thus, does not enable one 89 to study interactive G × E effects on the rate of progress toward flowering. Furthermore, we have Another issue is that the gene-based approach that thus far has evolved in the crop modeling 96 community has not been widely embraced by the genetics community, nor have the analytical 97 approaches used by geneticists to predict genetic effects on crop traits been adopted by the crop 98 modeling community. There have been limited interactions between these science communities 99 that might lead to more rapid advances in gene-based modeling. Hwang et al. (2017)   RILs and environment combinations were transformed into a development rate toward first flower 170 appearance, calculated as rate = 1/(days to first flower). This approach requires the implementation 171 of a function that predicts the daily development rate towards the time to first flower. 172 We designed a new module (DMLM; see Table 2  based on the developmental rate that is controlled by genotype and daily environmental conditions. 176 The time-to-flowering is determined when the cumulative addition of the daily progress time steps 177 reaches unity. Eq. (1) shows the DMLM module that contains four environmental variables, one 178 QTL × QTL interaction and seven QTL × E interactions. 179 , ( ) = (1) 180 Where FRs,g(t) is the rate of progress to flowering (1/d) for the g th genotype for the s th site at 181 time t (in days). represents the overall mean value of the daily development rates across all RILs 182 and sites in the MET dataset. In the linear function (Eq. 1), RILs were treated as random effects 183 and all remaining factors were considered as fixed effects. The variance-covariance structure that 184 was used was unstructured, which is the default in the lme4 R-package. Note that Eq. represents the coefficient for the q th QTL allele effect for RILg 198 (QTLq,g). Each QTL has a marker value numerically assigned according to its allelic identity; 199 Jamapa alleles were assigned as -1 and Calima alleles as +1 values (see Supporting Information -200 Table S1). 201 The daily rates (FRs,g(t)) in Eq. (1) are then accumulated or integrated over time to predict day 202 of first flower appearance using Eq. (2) and a daily time step (dt = 1). 203 where SUMFRs,g(t) integrates the flowering rate at time t (in days) starting on the day of 205 planting. SUMFRs,g(t) is set to 0.0 at the start of the simulation, and when it reaches 1.0, first 206 flowering is simulated to occur on that day t for the g th RIL at site s. approximately linear over a specific range of temperatures, and response plateaus as an optimum 227 temperature is reached. In fact, the effect of temperature on development rate can be more 228 accurately represented by a beta function (Ritchie and Nesmith 1991). Because the temperature 229 varies considerably within a single season, with location, and over time, plants are frequently 230 exposed to temperatures outside their linear response range. To help account for the non-linearity 231 response of common bean under high temperatures, a third module was created that uses a dynamic 232 piecewise linear function (referred to as the DPLM module) to ensure that the daily simulated 233 development rate is bounded to be within the range of temperatures that were observed in the MET 234 study and used to estimate the coefficients in Eq. (1) (Supporting Information - Fig. S2). 235 FRMAXg is defined as the rate at which the progress toward flowering proceeds when the 236 environmental conditions, i.e., the daily maximum and minimum temperature, day length, and 237 solar radiation, are at "optimum" values that result in a maximum development rate for the g th 238 genotype. The FRMAXg values were estimated by selecting the maximum rate (1/DURs,g) for each 239 RIL occurring across s environments, using the observed duration between planting and first 240 flower across all s sites using Eq. (3): 241 This resulted in 189 data points (one for each RIL plus the two parental lines). We determined 243 whether the values for FRMAXg were affected by the same QTLs that significantly affected the 244 time to first flower by estimating a linear relationship shown in Eq. (4). 245 where the variable is the fixed intercept estimated for the g th genotype and is the 248 coefficient that quantifies the allelic effect of the q th QTL on the maximum rate of progress. A 249 linear regression analysis was used to estimate coefficients of Eq. (4). 250 The final daily rate of first flowering was determined using the DPLM module that was 251

Incorporation of a gene-based module into CSM 267
The CSMG model (Hoogenboom et al. 1992(Hoogenboom et al. , 1994  To incorporate the first flower development stage using SUMFRDs,g(t) in the CSMG model, we 278 first developed a new gene-based module (GBM) to create a link between the DPLM module and 279 the crop model (Fig. 4). This module connects daily input data, the DPLM module, and the CSMG 280 phenology module. The inputs for this DPLM module consist of weather data from the crop model 281 and the 12 QTL allelic make up for each RIL (or genotype). The input QTL data for our study 282 were those for the 187 RILs plus the two parent cultivars, which are processed in a new QTL data 283 subroutine inside the GBM module. A new input file was created for the CSMG model, named 284 BNGRO047.GEN that contains QTL data for each of the RILs and their two parent cultivars (Table  285   S1). 286 The daily weather and QTL data for a particular site and RIL are inputs for the DPLM module, 287 enabling it to simulate the daily flowering rate as affected by G and E conditions. The integrated 288 development progress to first flower, SUMFRDs,g(t) and the day when first flowering occurs are 289 passed back to the CSMG phenology routine. The outputs from the GBM module are inputs to the 290 reproductive stage component, where the variables associated with first flowering are calculated. 291 The day when first flowering occurs is set and afterward, progress for subsequent development 292 phases are computed using the original CSMG model. 293

Sensitivity analysis of simulated variations for G × E 294
A simulation analysis was performed using the DPLM module to explore all possible 295 combinations of the 12 QTL variations among RILs using the daily weather data across all five 296 sites of the MET study, similar to previous ideotype studies (White et al. 2005). This resulted in a 297 total of 4,096 (2 12 ) RILs. The coefficients estimated using 187 RILs plus the two parents were 298 used to simulate the number of days to flowering for the 4,096 RILs. The input file 299 BNGRO047.GEN file containing QTL information was revised by adding inputs for each of the 300 4,096 RILs. Crop management including the planting dates and the daily weather data were 301 assumed to be same as for the original five environments of the MET study. The management 302 input file assumes that only the variations in genetics and environments affect the simulated 303 responses, representing potential production for each line. The DPLM module was then used to 304 conduct the 20,480 unique simulations across sites and synthetic RILs. 305 The time to first flower responses of the DPLM module were compared with those of the 306 DMLM-DL module to determine how the addition of a maximum rate, FRMAXg affects the 307 simulated results under different high temperature scenarios in a sensitivity analysis using the 187 308 RILs and two parents across all five sites of the MET study. The original daily temperature data 309 were used as the base line inputs. Then, both the minimum and maximum temperatures for each 310 day and each site were incremented at a 1 ºC increment to create five different temperature 311 scenarios (base, base+1, base+2, base+3, and base+4 C), assuming that crop management, daily 312 solar radiation, and day length were the same as for the original MET study. The simulated number 313 of days to flowering were analyzed and compared for the DPLM and DMLM-DL modules using 314 statistics and a visualization of the distributions of number of days to first. Although we did not 315 have sufficiently high temperatures in the MET study to account for the decrease in development 316 rate as the temperature increases above an optimal threshold, the simple addition of the maximum 317 rate in the piecewise linear module is expected to provide reliable simulations for small increases 318 in temperature. 319

Yield prediction 320
An ultimate goal of dynamic crop simulation models is to be able to predict yield. Therefore, 321 we compared the performance of the original CSMG model using GSPs with the DPLM module 322 integrated into CSM-CROPGRO-Drybean model (DPLM-CSM; Table 2 (1 − ) 363 The first term of Eq. (10) is the bias squared, the second term SDSD is related to the difference 364 between the simulation standard deviation (SDs) and the standard deviation of the measurements 365 (SDm). The third term, LCS, indicates the remaining MSE error that is not accounted for by bias 366 or standard deviation. 367  35148 10 −2 )). The first coefficient (α1 = -1.56357 10 −3 ) is the sensitivity to day length, 377 indicating that a one-hour increase in day length would result in a rate of development that is 378 1.56357 10 −3 below the average rate of 2.35148 10 −2 . This one-hour increase in day length 379 simulates that the time to first flower would occur 45.6 days after planting, an increase of 3.1 days 380 compared to the average days to first flower that was observed across the 5 sites and 187 RILs plus 381 the two parents. This rate of development also varies as a function of QTL alleles, which can 382 increase or decrease the rate resulting in a decrease or increase in the number of days to first flower, 383 respectively. Note that some the QTL coefficients in Eq. (11) have a negative sign while others 384 have a positive sign. This is because each parental genotype has both types of alleles; the allele 385 operator, i.e., Calima = +1 and Jamapa = -1, will alter the sign of the coefficient accordingly (Table  386   S1). The estimated parameters terms with the 2.5% and 97.5% confidence intervals, p-value, and 387 the variance components are shown in the supporting information (Table S2). The fixed effects 388 variance was 1.80182 x 10 -5 , the random effects variance was 6.34775 x 10 -7 , and the residual 389 variance was 1.3056 x 10 -6 . 390 Next, we compared the agreement between simulated and observed results for all sites, RILs, 391 and parents using the DMLM and DMLM-DL modules ( Fig. 2A, 2B and Table 3). Comparisons 392 of RMSE between simulated and observed values showed that the errors were only slightly 393 different between the two modules ( for PA. We attributed these differences to the fact that the MET did not include a site with long 400 days and low temperatures to contrast the long days and high temperatures of ND, which did not 401 adequately capture the temperature-day length interactions previously documented by Wallace and 402 Enriquez (1980) and Wallace et al. (1991). 403 The comparison of ME between the two module versions (DMLM and DMLM-DL) showed 404 that both have the same high model efficiency value of 0.90. Although ME values for the ND site 405 were lower (0.30 for both modules), these positive numbers indicate that the modules are more 406 effective than using the mean value of the observations. Table 3 also shows that the bias in 407 simulating time to flowering was low across all sites (less than 0.5 days), and MSE values were 408 low except for ND. The remaining error after accounting for bias and standard deviation 409 differences were much larger for both module versions at ND than for any of the other sites.

Dynamic piecewise linear module 415
The highest maximum rate for any genotype in the MET dataset was 0.0345 and the lowest 416 observed maximum rate for any genotype was 0.0222. This means that the duration from planting 417 to first flower varied from 29 to 45 days among genotypes under optimal environmental conditions. 418 These maximum rates of development occurred at the tropical PA and PR locations where 419 temperatures were warm and day lengths were relatively short. Although environmental conditions 420 may not have been optimal at these locations, most of the maximum rates across locations for any 421 RIL occurred in PA; only a few occurred in PR where the maximum rates for some RILs were The fitting of FRMAXg using Eq. (12) resulted in predicted caps on the rate of progress for the 444 187 RILs plus the two parents of our dataset with a RMSE of 1.66 days, ME of 0.77 days and MSE 445 of 2.78 (Fig. 3). These values indicate that the maximum developmental rates were affected by the 446 genetic factors (12 QTLs). 447

Structural changes of the CSM-CROPGRO-Drybean model 448
The new gene-based module (GBM) operates on a daily time step in the DPLM-CSM 449 phenology module to simulate the rate of development towards first flowering for a particular RIL 450 or cultivar and for a specific site, as shown in Fig. 4. This GBM module incorporates the DPLM 451 module and processes the QTL data obtained from the revised BNGRO047.GEN file, while 452 weather data are passed to the DPLM module from the CSMG routines. When the value of the observed data did not exhibit bimodal characteristics, except for the PO site, which had cooler 473 temperatures than the other sites. QTL2, which is associated with the growth habit gene Fin, shows 474 interaction with Tmin and is likely responsible for this bimodality (Bhakta et al. 2017). Also, on 475 average, the indeterminate growth habit RILs generally flowered later than the determinate growth 476 habit RILs. These graphs showed that Calima flowered earlier than Jamapa except for the ND site.  Table 3. 493 The original CSMG mostly produced simulated days to first flower that were in closer 494 agreement with observed results across all sites than the other modules (Table 3 and Fig. 2D). 495 However, these results are misleading in that the GSPs that produced these results were estimated 496 for each individual RIL, which means that only 5 data points were used to estimate 3 GSPs for 497 each RIL, and thus the agreements were forced in the GSP estimation process. The adjusted R 2 498 was calculated using five parameters for each RIL; the 113 RILs that had observations for all five 499 sites resulted in estimating 339 GSP parameters. The adjusted R 2 averaged for the RILs was 0.866, 500 ranging from 0.321 to 0.997 with a standard deviation of 0.129. However, note that the adjusted 501 R 2 values were lower than all of those for the gene-based modules for each site except at ND. As 502 Acharya et al. (2017) point out, the estimated GSPs were highly uncertain and that different 503 combinations of the GSPs could provide the same fit to observed data (showing equifinality in the 504 estimation process) such that the GSP estimates are not reliable even though they can nearly 505 reproduce the data. Estimating three parameters with only five data points, then repeating this 506 process for each of the RILs, results in estimates that reliably reproduce the data used to estimate 507 them but should not be interpreted as values that can be used for other environments or genotypes. 508 Estimation of these coefficients also requires considerable effort and resources, which has to be 509 repeated every time a new cultivar is released. Instead statistical gene-based modules can estimate 510 independently phenotypic traits using as input G, E, and G × E interactions data. By contrast, 511 estimating the 25 coefficients in the dynamic linear module (Eq. (11)) used all data across the five 512 sites and 189 (RILs plus parents), thus 945 observations were used to estimate 25 coefficients. 513 Therefore, using the dynamic mixed linear module estimation process has potential for a more 514 robust use of the module across environments and genotypes, especially for a new genotype that 515 has QTL information but does not have field phenotype data. 516 The frequency distributions associated with genetic variations in the RIL population for 517 simulated time to first flower at each site and for the five temperature scenarios are shown in Fig.  518 6. The comparisons of the means and standard deviations of the populations for each 4-519 temperature/site combination are summarized in Table 4. These results demonstrate the effects of 520 including the maximum rate of development (FRMAXg) for each genotype in the DPLM module 521 for comparison with the module without this upper limit. The largest differences in simulated days 522 to first flower occurred when the temperature was increased by 4 °C (Table 4)  We are not suggesting that use of the upper limit on development rate is robust for broad use, 533 but instead that the MET should include more sites that have a wider range of temperatures and 534 day lengths to enable nonlinear responses to be estimated. For example, improvements could be 535 attained using a beta function for temperature response, in addition to controlled environment 536 experiments with a wider range of genetic material to develop nonlinear functions to represent the 537 full range of environmental responses in this crop species. 538 539

Simulating response distributions for all potential genotype combinations 540
The performance of DPLM module across the five experimental sites was simulated for RILs 541 with all possible allelic combination (4096) of the twelve QTLs used by the dynamic time-to 542 flower module. Frequency distributions for the number of days from planting to first flower were 543 produced for each location (Fig. 7). The dots in the figure highlight the number of days required 544 for the Jamapa and Calima parents to reach the stage of first flowering. The simulated first 545 flowering dates at the ND site were later (mean of 59.1 days) and had a larger standard deviation 546 (9.10 days) compared to the other sites. The spread of simulated days to first flower ranged 547 between 41 and 94 days at ND due to its longer day lengths and some days with cooler 548 temperatures than other sites. The smallest average number of days to first flower was for sites 549 with high temperature conditions and short day lengths (PR and PA), where simulated means were 550 about 36 days for both locations, and standard deviations of 2.6 and 2.6 days, respectively, with 551 response ranges varying from 34 to 39 days for each location. The FL and PO sites with their warm 552 conditions showed simulated means of 42 days and 45 days, respectively, and standard deviations 553 of 3.5 days and 3.9 days, respectively. The shapes of the distributions were bell-shaped across all 554 sites except for ND which showed the flattest shape due to the G, E, G × E, and G × G interactions 555 in the mixed piecewise dynamic module. The altered behavior of the parental lines was also found 556 in these simulations, where Jamapa flowered earlier than Calima for the FL, PA, PO, and PR sites 557 and Calima flowered earlier than Jamapa for the ND site. 558

Yield prediction 559
For each of the five sites, we simulated yield using the original CSMG model and the original 560  (Fig 1), simulated mean yield by 562 CSMG was lowest for PA (190.0 ± 89 kg/ha) and highest for ND (637.3 ± 242 kg/ha) while for 563 the other three sites mean yield ranged from 304.7 ± 135 kg/ha for FL, 505.0 ± 270.0 kg/ha for 564 PO, and 540.0 ± 244.8 kg/ha for PR (Fig. 8). For the DPLM-CSM hybrid model, simulated yield 565 ranking among the five sites was similar. The highest mean yield was obtained for ND (720.0 ± 566 291.0 kg/ha), while the lowest mean yield was obtained for PA (234.4 ± 104.2 kg/ha). Mean yield 567 for FL was 354.8 ± 156.4 kg/ha, for PO was 625.0 ± 311.2 kg/ha, and for PR was 673.3 ± 264.7 568 kg/ha (Fig. 8). The differences in yield were due to the slight differences in simulated flowering 569 dates between the original CSMG model and the DPLM-CSM hybrid model ( Fig.2 and 5), while 570 all other growth and development processes were simulated exactly the same as the CSMG model 571 using the same inputs (Fig. 4). 572

Further advancement in gene-based modeling 573
This work presents an approach for incorporating gene-based modules into an existing crop 574 growth model for simulating days to first flower (by the DPLM module) and simulating all other 575 processes and final yield using original components of the CSMG. It builds on the approach 576 discussed by Vallejos et al. (2020). Only minor modifications were needed to enable their dynamic 577 model to be integrated as a module into the existing CSMG model (Hoogenboom et al. 1994;578 Jones et al. 2003). There were only small differences between results from the dynamic piecewise 579 linear module integrated into the CSMG model from our work and model results published by 580 Vallejos et al. (2020). We recognize the need for use of independent data to evaluate the predictive 581 capabilities in other environments and are working on that. In addition, there is a need to use a 582 more diverse population to evaluate the ability of the model to predict first flower occurrence 583 across genetic variations that may not be in the population used in the MET dataset. This work also shows that integrating the genetic information is a promising approach to predict 595 plant development stages of new genotypes and new environments. Instead of estimating GSPs for 596 a specific trait, it requires less effort when a new cultivar is released in that only QTL information 597 is required, saving time and resources that would be otherwise needed for phenotyping. The long-598 term expectation associated with most QTL studies is the replacement of each QTL linked marker 599 with the gene responsible for that particular QTL effect. This work further shows that genetic 600 modules for other processes can be based on statistical methods that are routinely used by 601 geneticists, if they are developed to replace equivalent modules in existing dynamic crop models. 602 However, it is clear that considerably more progress is needed to identify other issues that 603 might occur by combining these two types of models. There is a need to extend gene-based 604 modules to cover the full genetic variability of a crop and to introduce other process modules into 605 existing models. Further work is required to improve the gene-based module and to add other 606 processes that are linked dynamically with the crop model. 607

Conclusions 609
This study showed the potential for integrating a process-oriented gene-based module that 610 only requires genetic input information into an existing comprehensive crop model with its 611 empirical cultivar inputs without changing other modules or inputs. The CSM-CROPGRO-612 Drybean model with the integrated gene-based module was able to not only predict flowering 613 date using only QTL and weather information, but also final yield using the original GSPs for all 614 processes except rate of progression to first flower. This approach can be extended to other 615 processes for which QTL information is readily available. 616 617

Supporting Information 618
The following additional information is available in the online version of this article -619 Table S1: Recombinant inbred lines for common bean (Phaseolus vulgaris L.). Each Quantitative 620 trait loci (QTL) has a marker value according to its allelic identity, assigned as "+1" for the Calima 621 alleles and "-1" for the Jamapa alleles. This information was used as input for the Dynamic 622 Piecewise Linear Module (DPLM) coupled with the CSM-CROPGRO-Drybean model. 623       North Dakota (ND); Citra, Florida (FL); Puerto Rico (PR); Palmira, Colombia (PA) and Popayan,807 Colombia (PO). 808