Explainable machine learning models of major crop traits from satellite-monitored continent-wide field trial data

Four species of grass generate half of all human-consumed calories. However, abundant biological data on species that produce our food remain largely inaccessible, imposing direct barriers to understanding crop yield and fitness traits. Here, we assemble and analyse a continent-wide database of field experiments spanning 10 years and hundreds of thousands of machine-phenotyped populations of ten major crop species. Training an ensemble of machine learning models, using thousands of variables capturing weather, ground sensor, soil, chemical and fertilizer dosage, management and satellite data, produces robust cross-continent yield models exceeding R2 = 0.8 prediction accuracy. In contrast to ‘black box’ analytics, detailed interrogation of these models reveals drivers of crop behaviour and complex interactions predicting yield and agronomic traits. These results demonstrate the capacity of machine learning models to interrogate large datasets, generate new and testable outputs and predict crop behaviour, highlighting the powerful role of data in the future of food. Despite, and perhaps because of, extensive data regarding agricultural variables and plant traits, finding connections to crop yields can be difficult to compile. Machine learning models detailed here can provide accurate predictions to tease out behaviours.

O ver 2 billion people are projected to enter the world population by 2050 1 . Feeding these people sustainably requires an improved understanding of the complex evolutionary interactions driving major crop traits 2,3 and the application of this knowledge through plant breeding to produce new crop varieties. However, large-scale data on the growth and yield traits of major crops remain generally unavailable or inaccessible to academic scientists 4 and, where available, 'big data' often result in incomprehensible black-box models of plant behaviour. This presents two unsolved questions in plant science: how best to develop new models of complex and poorly understood plant behaviours using accessible big data and how to leverage these models to generate comprehensible and testable outputs.
Here, we demonstrate the promise of machine learning (ML) algorithms to provide robust prediction of important agronomic traits, including yield, and to develop and test new and readable models in crop biology. By linking satellite data to a publicly available big dataset, the Australian National Variety Trials (NVTs), we develop a framework to train and test accurate ML models that allow quantitative comparison of the drivers of yield and key agronomic traits. These findings highlight the potential of comprehensive cross-species models to predict and understand agronomic and environmental effects on crop performance and reveal pathways to generate comprehensible ML models of plant behaviour.
The NVT database constitutes one of the largest public experiments on earth ( Fig. 1 and Supplementary Table 1). The NVTs capture over a quarter of a million unique variety-year observations and over a million unique population-averaged phenotypes, aggregated from experimentally replicated plant populations in 6,547 geolocated randomized controlled experiments (Supplementary  Table 1; database descriptor, ref. 5 ). Each population contains hundreds of individual plants, sown at controlled densities and replicated across a randomized controlled design trial, each conducted and phenotyped according to highly standardized protocols 6 .
As such, the NVTs capture the aggregated phenotypic variation of hundreds of millions of individual organisms, across thousands of trial sites containing millions of plant populations 5,7 .
We linked these data to extensive satellite data characterizing canopy-level agronomic traits, vegetation, temperature and spectral patterns [8][9][10][11] (Supplementary Table 2 and Supplementary Fig. 1), weather-station data from the Australian Bureau of Meteorology (BOM), >10,000 standardized soil sample tests, extensive observational and site-management data, >50,000 field-years of stubble burn patterns 12 , the dose and timing of >350,000 chemical and fertilizer applications and crop rotation histories for >10,000 field-years 5 . Collectively these data capture patterns of management inputs, vegetation and environment and include satellite-derived observations (Fig. 1b) that improve the categorization of growing environments and reveal the complex environmental diversity of trial sites (Fig. 1c).
Using this open-source environmental and agronomic database, now available as a version-controlled open data resource 5 , we train a suite of robust ML algorithms for the prediction of key agronomic traits, including yield, flowering and grain protein (Supplementary Table 1). In addition to providing a catalogue of phenotype prediction models, these models are used to demonstrate the potential for ML algorithms to generate comprehensible outputs and testable interaction models beyond variable importance rankings and the 'black box' paradigm. Using targeted ML model interrogation and analysis, we generate candidates for the causal drivers of complex traits including yield and grain protein content. We reduce random forests to predictively valuable and readable prediction rules, using an approach pioneered by Deng 13 , with direct and testable outcomes for agronomic research. This model reduction approach reveals cross-domain interactions between variables that robustly predict trait variation and, in contrast with variable importance scores, directly represent the internal model dynamics and decision-making processes of ML models. As a result, rather than produce a black-box prediction model, these analyses reveal methods to generate both accurate forecast models and pathways for scientists to turn the previously obscure logic of ML models into hypotheses.

Results
Machine learning provided clear and accurate predictive models across a broad array of challenges. However, to provide a robust estimate of model accuracy it was necessary to extend model evaluation beyond the standard analytical framework. The standard approach to assessing ML accuracy is to use random-sample holdout values, with random observations excluded from the dataset before model training, and these values subsequently used as predictive targets to evaluate model accuracy. However, across all models this testing approach generated a misleading picture of model accuracy (Supplementary Table 3).
While baseline accuracy was high, ML models exhibited different performance across prediction challenges. These included agronomically relevant problems, such as using the data available at the time of sowing (TOS) for prediction of end-of-season phenotypic variation in new locations ( Fig. 3 and Table 1), the projection of end-of-season yield as a season progresses ( Supplementary Fig. 2) and annual forecast predictions ( Fig. 4 and Table 1).
Model accuracy extended to other agronomically important complex traits. Several traits could be predicted under holdout trial prediction and annual forecasts with accuracy of R 2 = 0.5 or   Predictive accuracy was largely retained in models trained using only data available at the TOS. Under holdout trial prediction, yield data could be predicted with up to R 2 = 0.80 accuracy under stratified cross-validated BCRFs (Table 1), while methods such as LSVMs (R 2 = 0.64), XGBMs (R 2 = 0.64) and PLSRs (R 2 = 0.58) displayed more limited accuracy (Table 1). Unsurprisingly, ML model accuracy fell under all annual forecast prediction tests (Table 1). For full-season data, over rolling annual forecasts, BCRF models retained the highest accuracy when predicting yield (black line Fig. 4, Table 1 and Supplementary Code 1), followed by the PLSR model (blue line Fig. 4; R 2 = 0.74; Table 1) and the XGBM (R 2 = 0.69; Supplementary Code 1). Similar patterns in accuracy occurred for ML models trained on data available at the TOS, 100 d after sowing (DAS) and 200 DAS, tested in 2018 only rather than rolling annual forecasts (Table 1). Again, the BCRF and PLSR models had the highest predictive accuracy, with the greatest reduction and lowest absolute accuracy in the LSVMs (from R 2 = 0.64-0.68 to R 2 = 0.37-0.45; Table 1).
As expected, ML approaches generally increased in accuracy with the inclusion of more years of data ( Fig. 4) and for predictions made at progressively later points in the growing season ( Supplementary  Fig. 3). However, rolling forecast data, in which models were used to predict variation in the next calendar year, displayed surprising patterns. Concurrent shifts in forecast accuracy occurred across different models, phenotypes and species (Fig. 4a,b) with accuracy increasing (for example, 2011-2012 and 2013-2014) or decreasing (for example, 2015-2016) across diverse models and phenotypes. While observations were limited, changes in model forecast accuracy were correlated across ML models and species (P < 0.02; Supplementary Code 1) despite fixed model training parameters, the relatively constant sample size of target data and the absence of overfitting (as measured in random holdout trials; Fig. 1b and Table 1). Some years were less predictable, across models and species, independent of model construction and sample size.
Annual forecast and random holdout trial prediction models were constructed both with and without partitioning crops into separate datasets for model training (Supplementary Code 1 and  Supplementary Table 4). Unsurprisingly, incorporating all species in a single 'omnibus' dataset often led to less accurate models (Supplementary Table 4). However, in some notable cases implementing a unified approach, where species were fit as binary independent variables, generated models of comparable or greater accuracy than models with identical parameters trained on single-species data (Fig. 3, Supplementary Fig. 4 and Supplementary Code 1). For example, when predicting wheat yield in 2018, annual forecast models trained using only wheat data to 100 DAS had R 2 = 0.66 accuracy under naive BCRFs, R 2 = 0.38 under LSVMs and R 2 = 0.59 under PLSR models (Supplementary Code 1 and Supplementary Table 4). When these models were retrained using cross-species data, using identical parameters, accuracy of 2018 wheat yield predictions improved marginally to R 2 = 0.69 for BCRFs, R 2 = 0.40 for LSVMs and R 2 = 0.65 for PLSRs (Supplementary Code 1 and Supplementary  Table 4). In canola, construction of omnibus cross-species models more substantially modified accuracy: at 100 DAS, accuracy increased from R 2 = 0.28 to 0.60 in naive BCRF and from R 2 = 0.19 to 0.37 in PLSRs, with a low-accuracy model of R 2 = 0.11 falling to R 2 = −0.008 in the high prediction variance LSVM models (Supplementary Code 1 and Supplementary Table 4).
Assessing the features and interactions behind predictively accurate models was approached on several fronts. While 'black box' models are generally impenetrable to further analysis, altering model inputs and examining shifts in model accuracy can provide limited insight into model mechanics. Collectively, for example, remote-sensing data were a key driver of model accuracy (Supplementary Figs. 5 and 6). Under a leave-one-out model testing approach, removal of the remote-sensing variables caused the greatest loss in predictive value (Supplementary Fig. 5 and Supplementary Code 2) compared to the more marginal effect of removing management data, BOM weather-station data or metadata (Supplementary Fig. 5 and Supplementary Code 2). This pattern, where satellite data held the greatest predictive value for model training, was reinforced across holdout trial and annual forecast model assessment (Supplementary Code 1). For example, under RPRM models the removal of satellite data reduces the accuracy of wheat yield annual forecast models, from R 2 = 0.75 for models containing all input variables to just R 2 = 0.36 for models trained without satellite data (Supplementary Code 2). In contrast, the impact from removal of weather-station data, management data or metadata was, at worst, a marginal reduction in accuracy from R 2 = 0.75 to 0.71 (Supplementary Code 2).
Retraining models using only single domains, such as using only management or satellite data for model training, provided insight into the predictive value of domains without cross-domain interactions ( Supplementary Fig. 6). Again, satellite data had the greatest contribution to accuracy in all ML models. Models constructed exclusively using satellite data predicted wheat yield variation with R 2 = 0.71 accuracy ( Supplementary Fig. 6), exceeding the accuracy of metadata-only (R 2 = 0.59), management-only (R 2 = 0.37) and weather-station data-only models (R 2 = 0.34; Supplementary Fig. 6 and Supplementary Code 1). While such results are valuable and broadly compatible with findings from the extensive remote-sensing literature, interrogating model accuracy in this way was fundamentally constrained: assessing the interactive contributions of smaller variable groupings and individual variables was not combinatorially feasible. Further interrogation of model dynamics using heuristics that rank features by their importance revealed the nominal value of individual variables, yet showed limited concordance between methods ( Fig. 5a-c). Some differences probably arose from the diverse scoring methods used (Supplementary Code 1). However, some variables, such as accumulated rainfall, attained high importance ranks across all models (blue points; Fig. 5d). Across PLSR, RPRM and BCRF models, consistently high importance ranks were assigned to cumulative rainfall, latent heat flux (an indicator of transpiration rates 21 and stomatal 22 conductance across a canopy), application rates of sulfur and application of the pesticide Clethodim ( Fig. 5a-c). Variable importance scores lack any indication of the direction of effect or whether orthogonal or interactive effects underpin the predictive value of 'important' variables. For example, importance scores impart very high ranks on two variables for cross-species yield, latent heat flux and Clethodim application (Fig. 5), yet convey no information on whether these variables act independently or additively, affect one or many species, raise or lower yield or behave in a context-dependent or independent manner. As such, ML models were interrogated to generate interpretable outputs to capture this finer detail.
Decision trees generated by RPRMs include the direction of observed effects and reveal possible variable interactions via hierarchical dependencies (Supplementary Figs. 7 and 8 and Supplementary Code 1). For example, the yield-predictive RPRM in Supplementary Code 1 reveals that, in TOS-only and full-season prediction models, lower soil sodicity, higher soil carbon, phosphorous and nitrogen and higher doses of total applied of nitrogen and phosphorous were uniformly predictive of higher yield. Other interactions were context-dependent: for example, higher rainfall was predictive of higher yield in eight of the 14 decision points in the full-season RPRM model above (Supplementary Code 1). Species-targeted RRPM decision trees also provided insight onto the full-season (Supplementary Fig. 7) and pre-sowing predictors of yield ( Supplementary Fig. 8 and Supplementary Code 1). Decision points included well-known agronomic interactions, such as gains in wheat yield from the pre-sowing application of nitrogen fertilizers or monoammonium phosphate ( Supplementary Fig. 8), as well as previously unknown interactions, such as the discrimination between lower-yielding populations through satellite reflectance bands and latent heat flux (leftmost branches; Supplementary Fig. 7).
Visual inspection of tree-based models does not work at scale, for example when generating a forest of thousands of decision trees using BCRFs. We overcame this problem by reducing BCRFs to their most common and predictively robust decision subtrees (the most common sequences of decisions within a random forest of decision trees) using the inTrees analytical approach v.1.2 (ref. 13 ). We found subtrees distinct from existing outputs: unlike variable importance scores, subtrees are independently functioning fragments of the actual ML model and directly captured complex interactive effects. This approach revealed complex yield-predictive dependencies between remote-sensing, environmental and management inputs (Supplementary Tables 5 and 6). For example, the most common predictively robust decision subtree in canola constitutes a complex cross-domain dependency (Supplementary  Table 5) where the effect of the normalized differenced vegetation index on canola yield depends, respectively, on the MODIS band 7 reflectance >0.10 and the total accumulated rainfall to 150 DAS falling below 310 mm. Similar patterns were revealed in wheat: for example, a high enhanced vegetation index at 160 DAS and a high maximum enhanced vegetation index at 150 DAS predicted a high 5.5 t ha -1 yield. Low-yield prediction rules included indicators of vegetation stress, such as low photosynthetic and vegetation indices (for example, Row 3, Supplementary  These 26 quantitative models, generated via ML, include a mix of well-known and new interactions predictive of yield (Supplementary Tables 5 and 6) that can be reformulated as testable hypotheses with trivial human input. The dominance of foliage-based and cumulative rainfall-based rules suggests ML is primarily capturing known drivers of plant growth, a desirable output of ML models. While the general importance of these variables in agronomy ranges from obvious to somewhat unclear, the pattern of interactions between these variables remains subject to extensive debate. Computationally reducing a large set of 'obvious' predictors to a small set of logical rules, for example quantitatively linking evapotranspiration rates to cumulative rainfall (Supplementary Table 5), is therefore a non-trivial outcome. In addition, the process also generates rules that incorporate new predictors. For example, the interaction of Diquat application with high land surface temperature and red reflectance values to predict low yield (Supplementary Table 5) produces a new management-environment interaction, between Diquat application and heat stress, accessible to further testing. This seeming capacity to both recapitulate human thinking, while generating new and testable ideas, is an exciting result that requires careful appraisal.
In line with their high importance rankings and their overrepresentation in RPART decision trees, and despite comprising just 2.2% of predictors, vegetation indices, gross primary productivity and latent heat flux were common across predictively valuable subtrees (Supplementary Tables 5 and 6). For example, NDVI occurs in 13% (11/83) of all decisions in Supplementary Tables 5 and 6, yet constitutes only 2.2% of all input variables. Likewise, cumulative rainfall (0.04% of the initial predictor variables) is present in nine of the 26 prediction rules and constitutes 11% of the prediction rule chains: a 25-fold overrepresentation.

Discussion
Foundational crops such as wheat, canola, and oats provide a substantial fraction of total human caloric intake 24 . However, traits underpinning the yield of these species are driven by poorly understood complex interactions. In particular, environments are largely characterized using ground-station data and, increasingly, highly complex but variable-coverage drone data. These approaches overlook the opportunity of satellites as consistent, regularly timed, instantaneous and global instruments for capturing environmental variation.
In this study, low-resolution satellite data are used to characterize environmental diversity across sites, providing several insights into crop yield. Even these crude whole-site measures added substantial predictive capacity to our models ( Supplementary Figs. 1, 7 and 8). In particular, our findings suggest the importance of latent heat flux, a proxy for both water availability 21 , stomatal conductance 21,22 and canopy transpiration rates 22 , as a predictor of yield variation across sites.
These findings suggest the remarkable potential for both existing low-resolution and developing high-resolution satellite resources for agriculture and plant breeding. In contrast with the 250-1,000 m pixel daily resolution data used here, sub-40 cm resolution data will be available multiple times daily from many providers as soon as late 2021. This resolution is sufficient to resolve, for example, individual cotton and canola plants every clear day throughout a field season. In our opinion, capturing environmental patterns at this spatial and temporal scale, along with the potential for direct satellite observation of key agronomic traits such as growth patterns and phenology 25 , has the capacity to dramatically alter the conduct of large-scale plant breeding trials. If these data can be meaningfully integrated into plant breeding models and used to inform plant biology, high-resolution satellite data may have a substantial future in agronomy. Integration of satellite data and drone and large 'omics data, into plant breeding models has largely been constrained by statistical barriers. It is generally easier to generate 'big' data and saturate plant populations with variables, than to provide a meaningful analysis of such data. That is not to say such success is not possible: for example, large 'omics datasets may be reduced to indices or trait values without substantial information loss 26 . However, this dimension reduction approach is not always appropriate and may reveal little or nothing about the interactions between variables. Integrating data into ML pipelines with comprehensible model mechanics and carefully interpreting the results, provides a pathway to solve these problems and meaningfully integrate advances in data generation with plant breeding platforms 26 .  in a and b), the absolute value of coefficients in PLSR models (x axes in b and c) or error reduction or shrinkage in RPRM models (x axis in a; y axis in c), showed some concordance between models. Concordance of importance scores is greater between (a) tree-based BCRFs and RPRMs, than between PLSRs and (b) BCRFs or (c) RPRMs. Likewise, as shown for BCRF models in d, the predictive importance of time-ordered variables (showing distribution of n = 1,577 non-zero importance scores from 7,875 predictor variables; with data presented as median values with boxes for IQR and whiskers for 95% CIs) increased throughout the season to a maximum around flowering and grain filling. Highly ranked variables across all models included cumulative total rainfall (blue), latent heat flux (fuchsia), dosage of the pesticide Clethodim (labelled), total applied sulfur fertilizer (labelled) and crop taxa (labelled). Models trained using full-season data, to predict yield in t ha -1 , shown.
A major advantage of ML models is the capacity to generate a single-stage, rapidly retrainable and unified model that captures interactions across multiple domains, such as the managementenvironment interactions shown in Supplementary Figs. 7 and 8. While our focus has been the capture of environmental interactions, our findings highlight the potential of ML to capture interactions across further data layers: metabolomic, transcriptomic, proteomic and large-scale phenomic data are all suitable for inclusion into a unified ML model.
While the promise of ML models is considerable, gains in predictive accuracy are limited by fundamental challenges. For example, concurrent changes in model accuracy across diverse models and species are unlikely to be a result of overfitting (Fig. 4). Rather, these patterns may arise due to fundamental model limitations when extrapolating in complex systems [27][28][29] . For example, while ML models generalize well across observed data, they suffer direct constraints on predictive accuracy under pattern extrapolation. This is epitomized by the 'checkerboard problem' , where ML models fail to extrapolate the alternating pattern of black and white squares from a smaller to a larger checkerboard 28 . The failure of ML models to extrapolate phenotypes to future years, across a stochastic 'checkerboard' of changing El Niño, neutral and La Niña climates, may therefore represent the failure of ML algorithms to extrapolate complex patterns.
Likewise, non-stationarity, a shift in mean and variance over time, places fundamental limits on the accuracy of all statistical models 29 . Cross-model changes in accuracy, when projecting new fields and years (Fig. 4), may reflect the independent or combined role of non-stationarity of management practices, soil diversity, hidden genetic diversity over time or the role of climatic or environmental non-stationarity over time.
As such, shifts in predictability of crop behaviour across the NVTs raises important questions on systems dynamics. If non-stationarity in the Australian climate drives collapses in the predictability of a complex system, crop behaviour and yield, within the space of a single year (Fig. 4), this has important ramifications for agronomy under climate change. Climates are becoming increasingly non-stationary 30 , including Australian grain-growing regions [31][32][33] , causing a dramatic global loss in the predictability of rainfall patterns 30 and temperatures 34,35 .
Non-stationary rainfall patterns are particularly concerning. Already, a quarter 36 of the global land surface area has non-stationary rainfall patterns and this fraction is rising 36 . This increasing climate non-stationarity causes crop breeding targets to become intrinsically less predictable, regardless of advances in models or methodology. Crop breeding pipelines from initial plant crosses to the release of new varieties require development times of around 10 years 37 . If the degree of climatic non-stationarity increases within this horizon and future climatic patterns become less predictable as targets for plant breeding, climate instability may pose a serious challenge to food production systems 38 given fundamental bounds on model predictability 29 .
There remains considerable cause for optimism from ML algorithms beyond these limitations. As we have shown, ML models can learn and recall the growing season of millions of plants, integrate these data into interpretable models and accurately forecast phenotypic variation. Furthermore, the promise of 'big data' and ML in agriculture is not constrained to gains in prediction accuracy. While improved accuracy from black-box models has enormous utility for share trading or insurance, such models often have more limited scientific and in-field applications. For the greatest utility for farmers and biologists, ML models need to be made accessible and understandable, even at the cost of predictive accuracy.
Predictions from neural networks or deep learning models are often more accurate, yet, with few exceptions 39,40 , are not comprehensible by examination of model dynamics. Even when black-box models produce variable importance rankings (for example, Fig. 5) it is unclear whether high-ranked predictive variables interact with one another or act independently, or in what context they are important, when generating a predictive outcome. In contrast, approaches such as RPRMs and BCRFs allow the understanding and interrogation of internal model dynamics: it is possible to ask why an algorithm generated a specific prediction.
For example, trusting Cauer's conceptual 'black box' to evaluate our models may have resulted in a surprising conclusion: that use of herbicides routinely has an inhibitory effect on yield. In the model shown in Supplementary Fig. 7, reduced yield was predicted by the previous application of common herbicides such as Roundup (1.3 t ha -1 yield loss) and Cadence (1.8 t ha -1 yield loss). However, the conclusion that herbicides are yield-inhibitory is probably incorrect. Herbicide application is confounded with pathogens, environmental stress and degraded soils: as such, herbicide use may predict lower yield because herbicide use is more likely under worse growing conditions. Discrimination between these cases depends on careful analysis of model mechanics, a process that is not possible within true black-box models.
Likewise, across the NVTs zero-yield and extremely low-yield trials are not missing at random but have been actively removed. This violation of the 'missing completely at random' criteria 41,42 has counterintuitive effects that are independent of the applied predictive model. For example, increasing frost severity predicts higher crop yield across trials ( Supplementary Fig. 9). This is not necessarily a result of severe frosts causing higher yield but, more likely, because the removal of low-and zero-yield trials has generated a non-random survival bias: better-conditioned, higher-yield crops are more likely to survive a severe frost than are crops in stressed and suboptimal environments. As a result, higher yields may be positively correlated with a worse environment purely as a statistical artefact. Black-box models are not panacea against such subtle issues.
Interpretation of ML models should not, therefore, be reliant on the simple scoring of variable importance. Rather, our results suggest that detailed assessment of the internal mechanics of ML models is a key analytical challenge for scientists who seek to understand, rather than simply predict, biological systems.

Methods
All data were analysed in R v.3.5.1 (ref. 43 ). Compilation of the NVT data resulted in the location and measurement of 266,033 variety-trial combinations, aggregated from 780,569 field trial plots in 6,547 successful (or non-failed) field trials. Linked to these data were >8,000 variables, including ~10,000 field-years of soil samples and hundreds of thousands of chemical and fertilizer doses. A total of 70 phenotypic traits were available, for over one-and-a-half million unique phenotypic measurements 5,7 . Phenotypes with a high degree of redundancy with higher-frequency traits (for example, pass-fail yield status versus yield) were excluded and the remaining target phenotypes with a sample size of >45,000 variety-years used as training targets. In addition, glucosinolate content for canola and flowering time were included as targets, due to a high perceived interest and moderately large (>15,000) sample sizes. As a result, seven diverse agronomically important traits, of sufficient data quality and sample size, were selected to train ML algorithms: grain protein percentage, days to 50% flowering, percentage glucosinolate oil content, hectolitre weight, 1,000-grain weight, the fraction of grain sieving below 2.0 mm and yield (Supplementary Table 1 and Supplementary  Figs. 3 and 4).
Data used to train PLSRs, XGBMs and LSVMs were dummy-coded for factors and transformed to zero mean and unit variance for numeric data (database descriptor ref. 5 provides further details). However, tree-based algorithms partition the input space on the basis of the non-transformed target variable and do not require rescaling to avoid model fit biases. As such, to preserve the direction and magnitude of effects and facilitate interpretable models, non-transformed data were used to train RPRMs, BCRF and xvBCRF models.
All missing data have been previously imputed using two approaches, described in the associated database descriptor 5 , for both untransformed and unit-variance, zero-mean transformed data. Before this imputation, missing data were concentrated into missing metadata and field trial comments measured in small subsets of trials, such as disease scores and animal damage scores. Only 0.05% of all satellite data were missing, generally as a result of a pixel failing the quality control screening of a NASA data product algorithm. Broader environmental data, including ground-sensor arrays and weather-station data, were 0.7% missing. Here, the variance in model accuracy arising from imputation noise and error in these existing imputations was evaluated by applying ML models to all imputed datasets, predicting site-mean yield and measuring model accuracy ( Supplementary Fig. 10 and Supplementary Code 1).
Models were trained using, as often as appropriate, default tuning and input parameters. Hyperparameters for model training were subjected to minimal training and optimization, with parameters given in Supplementary Table 7 and package versions in Supplementary Codes 1 and 2. All models were subjected to tenfold internal crossvalidation, with identical training and target data across models, to allow direct comparisons of accuracy.
Training targets. Yield models were initially trained to predict two holdout samples: a 'holdout trial' set of 100 field trials randomly sampled from the years 2008 to 2017 and an 'annual forecast' sample consisting of all data from 2018 (Fig. 2). Models were trained on data from 2008 to 2017 excluding the random holdout trials and used to predict both the holdout trials and the 2018 data. Random holdout trials did not overlap with training fields, were separated from training field trials by at least at least 9 months and an average of 4.5 yr, with an average 2.05 km separating a test site from the closest trial in the training set. For rolling forecast models presented in Fig. 4, no holdout trial sample was selected: instead, models were trained on all data before each successive cutoff year and tested on the next year's data.
In-season rolling forecasts were developed using all data available before a given time, relative to sowing and using these data to predict end-of-season yield in canola and wheat only. The three most functionally distinct methods were used: tree-based naive BCRFs trained in ranger, kernel-based LSVMs and principal components regression-based PLSRs (package versions in Supplementary Code 1). As in previous models, PLSR models were trained using tenfold crossvalidation, in ten segments, using default loss criteria (Supplementary Code 2). Due to the greater sample size constraints, these variety-specific models were subjected to random holdout trial prediction only (Fig. 2). Table 7 and Supplementary Codes 1 and 2.

Algorithms and hyperparameters. Hyperparameters used in model training tasks are given in Supplementary
Random forest models were constructed using three different approaches: naive BCRFs trained using the Ranger implementation 44 , BCRFs cross-validated by calendar year such that test data were never in the same year as training data 15 xvBCRF and XGBMs. To preserve the magnitude of changes to leaf node averages caused by decision points (for example, Supplementary Figs. 7 and 8) and the heuristics derived from these decisions (Supplementary Tables 5 and  6), RPRMs, BCRFs and xvBCRFs were trained using non-scaled data. Unlike other ML methods, scaling does not impact the accuracy of these tree-based partitioning methods.
XGBMs, LSVMs 19 and PLSRs 20 were trained using data rescaled to zero mean and unit variance, with factors excluded (for example, unstructured crop comment fields and unique trial identifiers) or dummy-coded (for example, varieties observed in over n > 1,000 plots, breeder names, experimental series, trial operators, trial comments common to >1,000 trials and crop species; Supplementary Codes 1 and 2).
The LSVM regression models were constructed using a grid-based stepwise search using fixed gamma and lambda values defined in Supplementary Table 7. Each LSVM was subject to tenfold internal crossvalidation, tuned over a 10 × 10 hyperparameter grid, using the liquidSVM package 19 v.1.2.1.
Importance and heuristic rule reduction. Variable importance scores were returned using default heuristics from the RPRM and BCRF models and variable importance was approximated in PLSR models by dividing coefficients by the sum of the absolute value of the coefficient matrix ( Fig. 4 and Supplementary Code 1). To reveal common interactions predictive of phenotypic variation, xvBCRF models were subjected to the analytical pipeline described in Deng 13 . This approach aggregates the frequency of all subtrees of decisions within random forests, to reveal the most common predictive decision sequences or paths (Supplementary Code 2). This set of decision paths is then pruned by treating common decision paths as features, retraining a regularized random forest using these and all previous features and ranking the importance of these decision paths in the subsequent model. By treating this interaction as a feature, the importance of complex dependencies between variables may be explicitly stated, in a way that is impossible with black-box models. Therefore, all predictively valuable decision paths were captured for all cross-species and species-specific xvBCRF yield models (Supplementary Code 2).
Reporting Summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
All data are available from the Supplementary Information, the linked database descriptor publication 5 uploaded to Scientific Data and the figshare 7 repository, after screening under our own extensive imputations and quality controls and freely available for research or non-commercial purposes under a CC-BY-NC 3.0 license. Some data available in this repository 5 are, alternately, available as a dataset on the Grains Research and Development Corporation website published under a CC-BY-NC 3.0 AU license. The dataset which forms the basis (in whole or part) for this paper is based predominantly on data sourced from the Grains Research & Development Corporation (GRDC) and GRDC's extensive investment in the collection, development and presentation of that dataset is acknowledged. GRDC did not authorise the reproduction, publication or communication of the dataset. The dataset has not been subject to GRDC's quality control processes, and does not include updates and corrections that have been made to the dataset and as such may be unreliable. Results of research based on the dataset should not be relied on for any purpose. Any person wishing to conduct research using National Variety Trials (NVT) data must approach GRDC directly with a research proposal, noting that terms and conditions may apply. Alternately, the extensively quality-controlled and imputed figshare repository remains freely accessible for research under a creative commons license 7 .

Code availability
All codes are available in Supplementary Codes 1 and 2. Fig. 1 | Environmental and vegetation patterns captured by remote sensing. Remote-sensing arrays capture seasonal patterns across a season, from sowing (red lines) through anthesis or flowering (blue lines), to harvest (orange lines), in diverse variables such as photosynthesis and leaf area (top row), vegetation indices and land surface temperatures (second row), cloud cover and raw reflectance values (third to fifth rows), infrared bands, latent heat flux, and fire events (fifth row). Variation shown for a single site (Turretfield 2010; Main Season planting), abbreviations as in Supplementary Table 2.