Genome-wide mediation analysis: bridging the divide between genotype and phenotype via transcriptomic data in maize

Mapping genotype to phenotype is an essential topic in genetics and genomics research. As the Omics data become increasingly available, genome-wide association study (GWAS) has been widely applied to establish the relationship between genotype and phenotype. However, signals detected by GWAS usually span broad genomic regions with many underneath candidate genes, making it challenging to interpret and validate the molecular functions of the candidate genes. Under the context of genetics research, we hypothesized a causal chain from genotype to phenotype partially mediated by intermediate molecular processes, i.e., gene expression. To test this hypothesis, we applied the high dimensional mediation analysis, a class of causal inference method with an assumed causal chain from the exposure to the mediator to the outcome, and implemented it to the maize diversity panel (N=280 lines). Using 40 publicly available agronomic traits, 66 newly generated metabolic traits, and published RNA-seq data from seven different tissues, we detected N=736 unique mediating genes, explaining an average of 12.7% phenotypic variance due to mediation. Noticeably, 83/736 (11%) genes were identified in mediating more than one trait, suggesting the prevalence of pleiotropic mediating effects. Among those pleiotropic mediators, benzox-azinone synthesis 13 (Bx13), a well-characterized gene encoding a 2-oxoglutarate-dependent dioxygenase, was identified mediating 40 agronomic and metabolic traits in different tissues. Further genetic and genomic analyses of the Bx13 and adjacent mediating genes suggested a 3D co-regulation modulation likely affect their expression levels and eventually lead to phenotypic consequences. Our results suggested the genome-wide mediation analysis is a powerful tool to integrate Omics data in providing causal inference to connect genotype to phenotype.


INTRODUCTION
Modeling metabolic and agronomic traits using the high-dimensional genome-wide mediation analysis 48 We conducted the high dimensional mediation analysis, one of the causal inference methods, to model 49 association between genotype (exposure variable, Z) and phenotype (outcome variable, Y) by including a 50 third hypothetical variable, gene expression, as the mediator (M) [12]. In the analysis, the assumed causal 51 relationships are that genomic variant contributes to phenotypic variantion either directly (Z → Y) or indirectly 52 through gene expression mediation (Z → M → Y), or both (Z → Y and Z → M → Y) (Figure 1). 53 To carry out the analysis, we employed the maize diveristy panel [14], because genotypic [15], phenotypic 54 [16], and transcriptomic [17] datasets for this population were publicly available. Additionally, we collected the 55 above-ground tissues from the two-week-old seedlings and conducted gas chromatography-mass spectrometry

Exposure (Z)
A  Figure S2) [14]. To account for the population structure, we fitted the first three principal components 62 computed from the genome-wide markers to the mediation model as confounders (see materials and methods). 63 For the analysis, we tested each normalized agronomic and metabolic trait against each of the seven RNA-seq 64 data obtained from different tissue types as the mediators to conduct a total of N=742 (106 traits × seven   = 0.029).

108
We considered genes mediating for more than one trait as pleiotropic mediating genes. In total, 83/736 109 (11%) pleiotropic mediating genes were identified, nine of which controlled more than five traits ( Figure   110 4A). Interestingly, 4/9 of these highly pleiotropic mediators mapped to the gene models with previously  Table S4 to find detailed traits mediated by these genes).

115
Notably, three of these highly pleiotropic genes SFR2, Bx13, and Zm00001d007725 were colocalized in a 116 physically proximate region on chromosome two ( Figure 4B); and they mediated 7 (1/6), 40 (15/25), and 8 (5/3) 117 phenotypic traits (agronomy/metabolite), respectively. The low linkage disequilibrium (LD) among these three 118 genes contradicted our initial hypothesis that these genes might be blocked in the same haplotypes ( Figure   119 4C). The further study identified interactive loops around the regions using HiChIP data [27], suggesting these 120 three pleiotropic genes might share the same regulatory modulation ( Figure 4B). The correlation coexpression 121 patterns of these three genes were strong in leaf base and tip at 3rd leaf stage tissues (Fig S5). Consistently,

122
we detected significantly different gene expression patterns for SFR2 and Bx13 in leaf base and tip between 123 lines carrying major and minor alleles (see Figure 4D for gene expression differences in other tissues). Taken

150
In this study, we conducted high dimensional mediation analysis and identified a number of mediating genes 151 that were previously reported to be associated with relevant traits. Some of the identified mediating genes For a population composed of n inbred lines, assume each of them contains p genes with expression data and 237 has been genotyped for q SNPs. We fitted each gene separately with gene expression level as the response 238 variable in the mediator model, the first several PCs as covariates, and all the SNPs as explanatory variables.

239
The model is shown as below: represents the whole SNP set of the population composed of n individuals with q number of SNPs; B j (a q × 1 244 vector) is the coefficients of the qth SNPs to the jth gene (j = 1, ..., p); and e j is the vector the residual errors 245 with e j ∼ N(0, σ 2 j I n ).

246
Additionally, we fitted all the SNPs and the mediating genes in the outcome model, which uses phenotype 247 as the response variable, as shown below.

248
In the model, 250 where, y (a n × 1 vector) represents the phenotype (here we used the BLUP value of a given trait); Q (a n × s 251 matrix) and Z (a n × q matrix) are the same matrices as model (1); v (a s × 1 vector) is the coefficients of first s 252 PCs to phenotype; a (a q × 1 vector) denotes the coefficients of the SNPs to phenotype; M (a n × p matrix) is 253 the expression levels of the p genes (i.e., a matrix combines all the M j ); c (a p × 1 vector) is the coefficients of 254 gene expressions to phenotype; e is the vector the residual errors with e ∼ N(0, σ 2 I n ).

255
Proportion of the variance mediated 256 To evaluate the relative contribution of different factors to phenotypic variance, we quantified the variance 257 due to different components as described in Qi's study [12]. Here, we partitioned the variance into two major And then, we obtained the proportion of the variance mediated (PVM) as below.
Genome wide association study 266 We conducted GWAS using the GEMMA software [38] with 22.5 million SNPs. In the analysis, we fitted the first 267 three principle components (PCs) as covariates calculated from genome-wide SNPs using the PLINK software.

268
To account for the genetic relatedness, we fitted the centered identical by state kinship matrix generated by   In the legend, "mixed" denotes the mixed inbreds; "nss" denotes Non-Stiff Stalk; "popcorn" denotes pop-corn; "ss" denotes Stiff Stalk; and "sweet" denotes sweet corn; "ts" denotes Tropical/Subtropical lines.   For each plot, the upper diagonal shows the Pearson correlation coefficient and lower diagonal shows the scatter plots of the two genes' expression levels. On the diagonal shows the histogram of the gene expression distribution. Asterisk denotes the statistical significance with *, p-value < 0.05, **, p-value < 0.01, and ***, p-value < 0.01.