Abstract
But it remains largely unclear to what extent these microbiomes contribute to trait variation for different genotypes and if their inclusion in the genomic selection (GS) protocol can enhance prediction accuracy. To address this, we developed a microbiome-enabled GS (MEGS) model that incorporated host SNPs and ASVs (amplicon sequence variants) from plant root-associated microbiomes in a maize diversity panel under high and low nitrogen (N) field conditions. Our results showed that the MEGS model significantly outperformed the conventional GS model for nearly all time-series traits related to plant growth and N responses, with an average relative improvement of 4%. This improvement was more significant for traits measured near the microbiome data collection date and was more pronounced under low N conditions, as some beneficial microbes can enhance N nutrient uptake, particularly in low N conditions. Our study also identified mediator microbes, such as Massilia putida, which were previously reported to promote plant growth under low N conditions. These large-effect ASVs or microbial agents could be applied as biofertilizers or soil additives to enhance crop performance for sustainable agriculture.
Introduction
An increasing number of studies have suggested that plant-associated microbial communities, especially the microbial species colonizing in the plant roots, can stimulate plant growth (Saleem et al. 2019), enhance nutrient availability in soils (Gomes et al. 2018; Zhu et al. 2016), and decrease abiotic stress responses (Hussain et al. 2018; Xu et al. 2018). Harnessing these beneficial microbes in crop production provides a promising opportunity for crop improvement to fight against climate challenges, reduce dependency on chemical fertilizers, and boost genetic gain. Indeed, from the beginning of plant domestication, root-associated microbes were reported to be involved in crop performance (Soldan et al. 2021). For example, studies have shown that domesticated plants exhibited distinct microbial compositions as compared to the wild ancestors and showed a reduced ability to establish symbiotic relationships with beneficial microbiomes (Abdelfattah et al. 2022; Abdullaeva et al. 2021). Recent crop improvement further reduced the microbial diversity in a number of different crop species, including wheat (Hetrick et al. 1992), maize (Sangabriel-Conde et al. 2014), and soybean (Kiers et al. 2007). Realizing the importance of beneficial microbiomes in contributing to crop production, recently, efforts have been made to screen for beneficial microbes as potential seed additives to promote plant performance (Singer et al. 2021; Yee et al. 2021). Though promising results were found in controlled environments (Eida et al. 2017; Kaur et al. 2020; Sessitsch et al. 2019), it is difficult for many microbial inoculants to survive for a long period of time under field conditions (Piromyou et al. 2011). Considering the collective genomes of microbial species colonized in plants as the secondary genome of a plant (Berendsen et al. 2012), a hologenome (i.e., plant genome with its associated endocellular or extracellular microbiome) approach to incorporate the naturally occurring genotype-specific microbiomes into the genomic selection (GS) protocol provides an alternative strategy to improve tomorrow’s crops.
GS, an innovative plant and animal breeding technology, enables the selection of promising individuals (and the associated heritable microbiomes) to advance to the next generation before or without phenotyping, therefore reducing the generation interval and increasing the genetic gain per unit of time. After the initial introduction of the landmark GS research conducted by Meuwissen et al. (2001), animal and plant breeders embraced the unprecedented acceleration of GS with the development of statistical methods and computational tools (de Koning 2016), including G-BLUP (genomic best linear unbiased prediction), RR-BLUP (ridge regression best linear unbiased prediction), and the Bayesian regression models (BayesA, BayesB, BayesCπ, Bayes LASSO, etc.) (Krishnappa et al. 2021; Wang et al. 2015; Burgueño et al. 2012; Wang et al. 2018; Gianola et al. 2009; Habier et al. 2011). The Bayesian regression models (with different assumptions for prior distributions of marker effects) allow genome-wide markers to have different effects and variances and usually achieve high prediction accuracy (Wang et al. 2018). However, Bayesian regression models might be more sensitive to the number of QTLs as compared to the RR-BLUP and G-BLUP methods (Wang et al. 2015). In addition to these parametric methods, non-parametric methods, such as reproducing kernel Hilbert space (RKHS) (Gianola et al. 2006) and deep learning (DL) based methods (Gianola et al. 2011), are likely be more efficient for capturing non-additive genetic effects than conventional methods, while there is no clear superiority in terms of prediction power (Montesinos-López et al. 2021). Albeit the rapid development of the GS methods in the past 20 years, little attention has been paid to incorporating host-associated microbiomes into the prediction protocols (but see a recent simulation study in dairy cattle (Pérez-Enciso et al. 2021)).
For the microbiome-enabled GS (MEGS) modeling, it is straightforward to consider different microbes in a linear mixed model as the random variables, similar to the SNP markers used for conventional GS. However, different from SNPs, microbiomes are dynamic and affected profoundly by root exudates that are composed of organic acids, polysaccharides, and other metabolites (Canarini et al. 2019). These differentially recruited microbiomes by different genotypes may, in turn, affect soil physicochemical characteristics and nutrient bioavailability for plants, i.e., nitrogen (N) availability, and eventually regulate plant physiological processes and lead to phenotypic variation. Under such a scenario, the microbiomes can be modeled as an intermediate process to bridge the host genotype and phenotype. Previously, a GS framework integrated Omics data as the intermediate process has been developed by extending the conventional linear mixed model to the neural network (Zhao et al. 2022). A similar concept implemented into an associated model allows us to conduct genome-wide mediation analysis to identify large-effect intermediate mediators (Yang et al. 2022a).
In this study, by leveraging the conventional GS methods and our recently developed mediation models, we conducted a MEGS experiment using a published dataset (Meier et al. 2022) collected on the maize diversity panel — a panel representing maize genetic diversity in temperate latitudes (Flint-Garcia et al. 2005). The root-associated microbiomes were collected under high N (HN) and low N (LN) field conditions. From the same fields, phenotypic data were collected using an unmanned aerial vehicle (UAV) in a time-series manner (Rodene et al. 2022). In our MEGS, a linear mixed model is used to predict maize phenotype by including both host genotypes and root-associated microbiome, where SNP markers effects and microbiome effects were treated as random effects with different variance components. This MEGS method will serve as an initial trial to combine genome and microbiome together to improve prediction accuracy and screen for a beneficial microbe that can be used as the seed additive in field conditions. Overall, our study highlights the potential of MEGS for improving crop performance and underscores the importance of considering plant-microbe interactions in the development of sustainable agricultural practices.
Materials and methods
Microbiome and phenotype data in the maize diversity panel
The root-associated microbiome data was obtained from our previously published study collected from a subset of the maize diversity panel (n = 230 genotypes) eight weeks after planting in both high N and low N field conditions (Meier et al. 2022). The microbiome data included n = 3, 626 amplicon sequence variants (ASVs) that can be clustered into 150 microbial groups. These microbial groups spanned 19 major classes of rhizosphere microbiota. In summary, the relative abundances of n = 3, 626 ASVs were collected for n = 795 observations.
Meanwhile, high throughput phenotyping data were collected from the same field in a time-series manner using UAV (Rodene et al. 2022). After image analysis at the plot level, a number of vegetation indices (VIs) were obtained, and some of which were highly correlated with conventional agronomic traits, such as leaf N content and 20 kernel weight. The environmental conditions during which VIs were collected were different, resulting in different qualities. Here we selected the data on 11, 21, and 35 days after microbiome sampling with similar high qualities. More details of phenotypic data can be found in Rodene et al. (2022).
Genotypic data processing and linkage disequilibrium (LD) pruning
The genotypic data for maize HapMap V3.2.1 (with imputation, AGPv4) were obtained from the Panzea database (https://www.panzea.org/genotypes) (Bukowski et al. 2017). Using the PLINK software (Purcell et al. 2007), we merged the variants on different chromosomes and retained the bi-allelic SNPs only. We then performed SNP filtration by discarding variants with the missing rate > 0.3 across lines and a minor allele frequency (MAF) < 0.05, resulting in a subset of 22.5 million SNPs. Then, the LD-based SNP pruning was performed by calculating LD between each pair of SNPs in the window of 10 kb, and one of a pair of SNPs was removed if the LD was greater than 0.1. We then shifted the window 10 bp forward and repeated the procedure and resulting in a final subset of 770k SNPs.
Linear mixed model for microbiome-enabled genomic prediction
We conducted the MEGS using the rrBLUP software (Endelman 2011), where n = 50, 000 randomly selected SNPs and n = 3, 626 reproducible ASVs were included simultaneously (Meier et al. 2022). Below is the model used for genomic prediction with both maize SNPs and root-associated ASVs: where, yijk is the observation of phenotype for the kth genotype in the jth block with the ith N treatment level, and there are 795 observations in total; μ is the intercept; fi is the fixed effect of the ith N treatment (i = 1, 2); bj is the fixed effect of the jth block (j = 1, 2); αl is the random coefficient of the lth SNP; gkl is the value of the lth SNP for kth genotype (l = 1, …, n, where n = 50, 000 is the total number of SNPs); γt is the random coefficient of the tth ASV (t = 1, …, s, where s = 3, 626 is the total number of ASVs); mijkt is the value (log relative abundance from 16S sequencing) of the tth ASV for kth genotype in jth block with the ith N level; and ∈ijk is the residual error.
In the model, we assumed the random coefficients of the lth SNP (αl), the tth ASV (γt), and the residual error (ϵ) are independent variables following normal distributions with a mean of zero and estimated variances of (i.e., for VARI trait at 11 days after microbiome sampling), (i.e., for VARI trait at 11 days after microbiome sampling), and , respectively. To estimate the marker effect variance , we only fitted SNPs with random effects without ASVs in an RR-BLUP model. Similarly, to get an estimate for ASV effects variance , we only fitted ASVs with random effects with the first three principal components (PCs) of SNPs (fixed effects) to control for the genomic background effect in an RR-BLUP model.
Mediation analysis by considering microbes as intermediate variables
Mediation analysis introduces a variable called a mediator to infer the underlying mechanism of the relationship between an independent variable and a dependent variable (Baron and Kenny 1986). Similar to our previous study (Yang et al. 2022b), here we conducted genome-wide mediation analysis using those 50,000 SNPs as exposures, 3,626 ASVs as microbe mediators, and the first three principal components (PCs) as confounders. We chose to use the “MedFix” algorithm, which showed higher power in our previous study (Yang et al. 2022b). The mediation analysis consists of two models: the mediator model and the outcome model. Specifically, the mediator model is shown as follows: where, Mj (a n × 1 vector) represents the abundance of the jth ASV; Q (a n × s matrix) is the first s PCs of genotypes (s = 3 in our analysis); Aj (a s × 1 vector) is the coefficients of the first s PCs to the jth ASV; Z (a n × q matrix) represents the SNP set of the population composed of n individuals with q number of SNPs; Bj (a q × 1 vector) is the coefficients of the q SNPs to the jth ASV (j = 1, …, p); and ej is the vector of the residual errors with .
Additionally, we fitted the SNPs and the microbe mediators (i.e., ASVs) in the outcome model, which uses phenotype as the response variable, as shown below: where, y (a n × 1 vector) represents the phenotype; Q (a n × s matrix) and Z (a n × q matrix) are the same matrices as the above mediator model; v (a s × 1 vector) is the coefficients of the first s = 3 PCs; a (a q × 1 vector) denotes the coefficients of the SNPs to the phenotype; M (a n × p matrix) is the abundance of the ASVs (i.e., a matrix combines all the vectors of Mj); c (a p × 1 vector) is the coefficients of ASVs to the phenotype; e is the vector the residual errors with .
The SNPs are determined as indirect SNPs for the jth ASV if (the p-values of χ2 likelihood ratio test for the corresponding Bj) are smaller than 0.05; and the jth ASV is determined as a mediator if the max(,pcj) (pcj is the p-value of the z-test for the jth element of c) is smaller than 0.05, with a step-down procedure to control for the false discovery proportion (Zhang 2021).
Results
Incorporating microbiome increased prediction accuracy for host phenotype
Plant roots are colonized by tens of thousands of microbial species, many of which are heritable (Peiffer et al. 2013; Meier et al. 2021). To test whether these root-associated microbiomes can be leveraged to predict the plant phenotypic performance, we obtained the composition of microbiomes quantified on a diversity panel of 230 maize inbred lines under high N (HN) and low N (LN) field conditions (Meier et al. 2022). Additionally, we obtained the image-extracted vegetation index traits collected from the same field (Rodene et al. 2022). Because most of the vegetation index traits are highly correlated (see Figure S1), we focused the analysis on one of the most representative vegetation indices — Visible Atmospherically Resistant Index (VARI) — collected on three dates, i.e., 11, 21, and 35 days after microbiome sampling (DAMS) in the field.
We fitted the ASVs (microbiome) and SNPs (host genome) as the explanatory variables with random effects into a linear mixed model (Materials and Methods). After 20 randomized five-fold cross-validations, results suggested that incorporating ASVs into the model significantly increased the GS prediction accuracy compared to the conventional SNP-only model for all three dates (Figure 1A). The overall prediction accuracy for different dates decreased, likely due to the low heritability of VAGI traits as plants close to senescence. To ensure that the greater prediction accuracy was not simply due to the larger number of explanatory variables, we randomly shuffled the ASVs to create the same set of dummy variables. We found that the extra number of shuffled dummy variables had no significant effect on prediction accuracy improvement (Figure 1A). Compared with the conventional GS model using only SNPs, the percentages of improvement by adding ASVs were 4.6%, 4.6%, and 3.1% on the 11, 21, and 35 DAMS, respectively (Figure 1B). Even compared with results using randomly shuffled ASVs, an average of 4.1% improvement was achieved. The improvement in prediction accuracy was comparatively greater for traits at 11 and 21 days after microbiomes were sampled in the field and became much lower for traits after 35 days, likely because the effects of microbiomes on the plant phenotype were diminishing. Similar results were also observed for other image-extracted traits (see Figure S2).
Prediction improvement was predominantly under low N conditions
To investigate the effect of N treatment on MEGS results, we fitted the model separately under high N and low N conditions. In the high N conditions, the MEGS model did not perform better than the conventional SNP-only model, except for a slight improvement on 21 DAMS (Figure 2A). Interestingly, in the low N conditions, significant improvements were observed for all three days. Specifically, the prediction accuracy increased from 56.3% to 61.1% on 11 DAMS, 44.8% to 56.0% on 21 DAMS, and 24.2% to 34.5% on 35 DAMS (Figure 2B). These findings support the notion that microbiomes can enhance N nutrient bioavailability in the soil (Jacoby et al. 2017), and, therefore, may serve as functional markers to improve the prediction of plant phenotypes.
Microbiomes show larger effects in earlier days after microbiome sampling
In order to better understand the impact of specific microbes on prediction, we evaluated the effect sizes of the ASVs and SNPs included in the prediction model (Figure 3). As expected, the overall SNP effects remained consistent across different dates (Figure 3B). However, the effects of ASVs declined as the number of days between initial microbiome sampling and plant phenotype collection increased (Figure 3A). We identified four ASVs that consistently had relatively large effects across all dates (i.e., they were among the top 1% of absolute coefficients listed in Supplementary Table S1). These ASVs were annotated as Candidatus Udaeobacter copiosus, Massilia niabensis, Pseudomonas parafulva, and Sphingomonas limnosediminicola. Interestingly, Pseudomonas parafulva (Oteino et al. 2015; Preston 2004) and Massilia niabensis (Yu et al. 2021) have previously been reported to promote plant growth. Some of the other top 1% ASVs, which were only detected on certain dates, may also be important candidates for further investigation. For example, Bacillus fumarioli was only detected as one of the top 1% ASVs on 11 DAMS. Previous studies have suggested that Bacillus fumarioli is likely to be under host selection (Meier et al. 2022) and may promote plant growth (Kumar et al. 2012).
To identify shared features of the ASVs with the largest effects, we compared the top 1% ASVs with all the other ASVs with regard to their heritability and selection scores, as calculated by Meier et al. (2022). The results indicated that although the top 1% ASVs were not more heritable under either high or low N field conditions, they had significantly higher selection scores under low N conditions on 11 and 21 DAMS (Figure S3). These findings are consistent with the greater improvement in prediction accuracy observed in the low N field due to the microbiome. Furthermore, they suggest that plant hosts may have a genetic mechanism for recruiting certain microbes to mitigate low N stress.
Microbiome mediation analysis identified Massilia putida as a promising microbe mediator
In order to establish a causal chain from plant genotype to microbiome to plant phenotype, we sought to model ASVs as the intermediate mediators that are selectively recruited by different plant genotypes and have a significant effect on plant phenotype. Therefore, we supplemented the MEGS with our previously developed genome-wide mediation analysis (Yang et al. 2022a). Through this approach (see Materials and Methods), we identified five unique ASVs that act as mediators for 8 VIs (Supplementary Table S2). Among these, Massilia putida and Burkholderia pseudomallei were the top 1% largest effect ASVs in the MEGS model.
To examine whether there were significant differences in the accumulation of Massilia putida among different genotypes, we divided the population based on SNP genotype at the largest effect indirect SNP (or iSNP, an SNP that significantly associated with the relative abundance of an ASV). The iSNP 1-240925424, located on the gene body of GDP-mannose-3’5’-epimerase1 (Zm00001d032950), showed a minor allele of T and a major allele of A. Our microbiome mediation analysis revealed significant differences between the two alleles under both high N and low N conditions (Figure 4A). To investigate the effects of Massilia putida on plant phenotype, we performed Spearman’s rank correlation test to determine if there was a significant correlation between the abundance of Massilia putida and the VARI phenotype. Under high N conditions, we found barely negative correlations on 11 (r = −0.16, p-value = 3.2× e −3), 21 (r = −0.026, p-value = 0.62), and 35 (r = 0.023, p-value = 0.66) days (Figure 4B). Under low N conditions, results showed significantly larger positive correlations on days 11 (r = 0.19, p-value = 2.9 × e − 4) and 21 (r = 0.25, p-value = 2.1 × e − 6), yet the correlation on 35 DAMS was insignificant (r = 0.08, p-value = 0.13) (Figure 4C). These results are consistent with the finding that Massilia putida promotes plant growth under low N conditions.
Discussion
In this study, we developed a microbiome-enabled genomic selection (MEGS) method and provided empirical evidence that incorporating microbiome data led to a significant increase (about 4%) in prediction accuracy for most traits extracted from UAV images across different dates. We acknowledged that there might be correlations between SNPs (host genome) and ASVs (microbiome), leading to multicollinearity, but this won’t affect the screen for beneficial ASVs as functional markers for plant breeding. As the cost of obtaining ASV data through 16S rRNA sequencing decreases with advances in sequencing technology, MEGS presents an unprecedented opportunity to predict complex traits, such as N or water usage efficiency, by utilizing ASVs as functional markers.
We analyzed the prediction accuracy under high and low N conditions separately and discovered that microbiome data was more beneficial under the latter. This is consistent with the idea that plants may require specific microbes to facilitate nutrient absorption when soil N is inadequate. Our mediation analysis, which considered the microbes as intermediaries between plant genotype and plant phenotype, pinpointed five microbe mediators. Among these, two mediator microbes, namely Massilia putida and Burkholderia pseudomallei, played crucial roles in predicting the plant phenotype in the MEGS analysis. Notably, Massila putida was enriched in the rhizosphere and is thought to enhance plant growth and N uptake by inducing lateral root formation under low N conditions (Yu et al. 2021). We also detected several other large-effect ASVs, such as Pseudomonas parafulva and Bacillus fumarioli, some of which have been linked to plant development (Oteino et al. 2015; Preston 2004; Kumar et al. 2012). However, further phenotypic and functional validation of these identified microbe mediators and large-effect microbes is necessary to reveal their effects on the host plants.
Due to computational constraints, we randomly sampled a subset of SNPs in low LD across the whole genome to fit the MEGS models. To determine the influence of SNP size on prediction accuracy, we performed a sensitivity test and found no significant difference in prediction accuracy when random sampling 10k, 25k, and 50k SNPs out of the 770k low LD SNPs. The prediction accuracy was reasonably high, ranging from 52.1%-76.4% for both SNP-only and SNP-ASV models across different dates (Figure S4). To maintain high fidelity under the current computation resource, we chose 50k SNPs for all traits and dates in the study. However, the use of more computationally efficient methods is necessary to handle large SNP datasets in the future. Although the present study was able to achieve reasonable prediction accuracy with a limited set of SNPs and the whole set of ASVs, the use of more comprehensive datasets would undoubtedly improve the accuracy of the MEGS models.
In summary, our findings suggest that incorporating ASVs into the GS model enhances prediction accuracy compared to the traditional SNP-only model, particularly for traits collected under low N conditions and at earlier time points after microbiome sampling in the field. However, the overall prediction accuracy for different dates decreases as plants approach senescence, indicating that the microbiome, as a functional marker, is sensitive to both time and environment. To successfully implement MEGS in a breeding program, it is crucial to carefully design experiments for in-field microbiome data collection to ensure the quality of prediction results.
Data and Code Availability
The data and code used for the analyses can be accessed through GitHub (https://github.com/ZhikaiYang/GP_microbiome).
Conflicts of interest
The authors declare no competing interests.
Supplemental Tables
Table S1 Top 1% ASVs detected by the MEGS model. (https://github.com/ZhikaiYang/GP_microbiome/blob/master/data/supplementary/Supplemental_Table_S1_top_one_percent_asvs.txt)
Table S2 Mediators ASVs identified by mediation analysis. (https://github.com/ZhikaiYang/GP_microbiome/blob/master/data/supplementary/Supplemental_Table_S2_mediator_asvs.txt)
Supplemental Figures
Acknowledgements
This work is supported by the Agriculture and Food Research Initiative Grant number 2019-67013-29167 and 2022-67013-3656 from the USDA National Institute of Food and Agriculture. This work is conducted using the Holland Computing Center of the University of Nebraska-Lincoln Start-up, which receives support from the Nebraska Research Initiative.