Gene regulatory network inference in soybean upon infection by Phytophthora sojae

Phytophthora sojae is a soil-borne oomycete and the causal agent of Phytophthora root and stem rot (PRR) in soybean (Glycine max [L.] Merrill). Yield losses attributed to P. sojae are devastating in disease-conducive environments, with global estimates surpassing 1.1 million tonnes annually. Historically, management of PRR has entailed host genetic resistance (both vertical and horizontal) complemented by disease-suppressive cultural practices (e.g., oomicide application). However, the vast expansion of complex and/or diverse P. sojae pathotypes necessitates developing novel technologies to attenuate PRR in field environments. Therefore, the objective of the present study was to couple high-throughput sequencing data and deep learning to elucidate molecular features in soybean following infection by P. sojae. In doing so, we generated transcriptomes to identify differentially expressed genes (DEGs) during compatible and incompatible interactions with P. sojae and a mock inoculation. The expression data were then used to select two defense-related transcription factors (TFs) belonging to WRKY and RAV families. DNA Affinity Purification and sequencing (DAP-seq) data were obtained for each TF, providing putative DNA binding sites in the soybean genome. These bound sites were used to train Deep Neural Networks with convolutional and recurrent layers to predict new target sites of WRKY and RAV family members in the DEG set. Moreover, we leveraged publicly available Arabidopsis (Arabidopsis thaliana) DAP-seq data for five TF families enriched in our transcriptome analysis to train similar models. These Arabidopsis data-based models were used for cross-species TF binding site prediction on soybean. Finally, we created a gene regulatory network depicting TF-target gene interactions that orchestrate an immune response against P. sojae. Information herein provides novel insight into molecular plant-pathogen interaction and may prove useful in developing soybean cultivars with more durable resistance to P. sojae. Author Summary Global food security is threatened continually by plant pathogens. One approach to circumvent these disease-causing agents entails understanding how hosts balance primary growth and defense upon pathogen perception. Molecular signatures of perception-rendered defense may be leveraged subsequently to develop resistant/tolerant crop plants. Additionally, evidence suggests that the plant immune system is characterized by tuning primary and secondary metabolic activity via transcription factor-mediated transcriptional reprogramming. Therefore, we investigated transcription factor-target gene interactions in soybean upon infection by compatible and incompatible races of Phytophthora sojae. Through transcriptome analysis, we found that the interactions elicited vast, overlapping transcriptional responses and identified overrepresented, defense-related transcription factor families. We then generated/acquired DNA-protein interactome data for the most represented transcription factor families in the transcriptome analysis and trained deep learning-based models to predict novel transcription factor targets. Transcription factor/target gene metrics were used to construct a gene regulatory network with prioritized components. We identified hub transcription factors belonging to WRKY and ERF families, the majority of which function in response to various biotic and abiotic stressors. These findings propose novel regulators in the soybean defense response to Phytophthora sojae and provide an avenue for the investigation of transcription factor-target gene interactions in plants.


153
In the present study, we coupled transcriptomics, in vitro TF-DNA interaction profiling, 154 and deep learning to construct a GRN underlying the soybean defense response to P. sojae 155 infection (Fig 1). We first inoculated hypocotyls of soybean variety Williams 82 (possesses Rps1k) 156 with mycelial slurries from P. sojae Races 1 or 25, rendering incompatible and compatible 157 interactions, respectively. Transcriptomes were generated from the hypocotyls, and differential 158 gene expression analysis was performed with expression profiles from a mock inoculation serving 159 as a baseline. Following RNA capture-based validation of the experimental design, we assessed 160 TF representation in the differentially expressed gene (DEG) set. The biological significance of 161 overrepresented TF families was inferred by clustering DEGs, assigning functional annotations to 162 each cluster, and observing TF family representation across defense-related clusters. Next, using 163 DAP-seq data for WRKY and RAV TFs differentially expressed in our DEG set, we obtained 164 promoter-localized TFBS to train DNN models for the prediction of novel WRKY and RAV 165 targets. Furthermore, cross-species target prediction was performed for MYB, WRKY, NAC, ERF, 166 and bHLH TF families using DNN models trained with available Arabidopsis DAP-seq data. We 167 observed the representation of predicted targets in our DEG set and used highly confident TF- 178 At 24 hrs post-infection (hpi), the hypocotyls were collected and used to generate 13 RNA-seq 179 libraries (4 Mock, 4 R1, and 5 R25) spanning two independent inoculations, RNA isolations, and 180 sequencing runs. Additional seedlings were maintained seven days post-infection (dpi) to compare 181 disease development across treatments (Fig 2a) [50]. Collectively, the RNA-seq samples 182 comprised over 560 million 100-bp paired-end reads with a mean mapping rate of 95% (Table   183 S1). Principal Component Analysis demonstrated that samples clustered according to treatment 184 and sequencing event. To circumvent the latter, we used ComBat-seq [51], a negative binomial 185 regression model, to correct batch effects. We then performed differential gene expression analysis 186 and removed any genes with expression that had been significantly changed (False Discovery Rate 187 < 0.05) between the two batches. Furthermore, we checked the expression of six genes that 212 found 447 (7.3% of DEGs) distributed across 43 TF families. Among these families, WRKY was 213 the most represented by total abundance (n = 65 encoding DEGs), followed by ERF (n = 56), 214 bHLH (n = 44), MYB (n = 37), and C2H2 (n = 31) (Fig 2c). The RAV TF family was most 215 represented by the percentage of genome-wide proportion (60%) (Fig 2c). Moreover, proportions 216 of WRKY, ERF, HSF, CAMTA, and RAV TF-encoding genes were significantly enriched in the 217 DEG set (hypergeometric p-value < 0.05).

218
To predict functions of the defined TF families in the present pathosystem, K-Means 219 clustering was used to segregate DEGs into nine co-expression clusters. In clusters 3 and 6 220 (hereafter "down-regulated clusters''), both compatible and incompatible interactions showed 221 reduced expression in comparison to Mock, with mean expression higher in the incompatible 222 interaction than in the compatible (Fig 2d). A reciprocal pattern was observed in clusters 1, 2, 4, 223 5, 7, 8, and 9 (hereafter "up-regulated clusters''), wherein the majority of genes were up-regulated 224 compared to the Mock, and the compatible interaction displayed the highest mean expression (Fig   225 2d). We explored these trends by assigning functional annotations to each gene cluster with GO  (Fig 2e). Furthermore, four TF families were represented in more than half of 241 the up-regulated clusters, with ERF present in 7/7, WRKY in 6/7, C2H2 in 4/7, and MYB in 4/7.  (Fig 4), using peak 289 summits of WRKY30_P1 and WRKY30_M1 samples with either 32-or 201-bp peak regions (Fig   290 S2). The CRNN with a 201-bp region outperformed the CRNN with a 32-bp peak region and 291 displayed an 89% validation accuracy, 90% test accuracy, and a false positive rate of less than 3% 292 ( Table 1). We trained a similar model for GmRAV, which had an 89% validation accuracy, 89% 293 test accuracy, and a less than 3.5% false positive rate ( Table 1). Moreover, for both GmWRKY30 294 and GmRAV CRNN models, the area under the receiver operating characteristic (auROC) curve 295 was beyond 0.88, and the area under the precision-recall curve (auPRC) beyond 0.81 (Fig S3). To 296 determine if a CRNN model trained for one TF could generalize to members of the same family 297 intraspecifically, we generated AmpDAP-seq data for GmWRKY2 (homologous to AtWRKY2 298 and encoded by Glyma.06G320700). For this sample, we first observed peak distribution across 299 genomic features (Fig S4; Data S5) and used MEME-ChIP to verify statistical enrichment of the 300 W-box CRE within peak regions. We then used peak regions to test if the GmWRKY30 CRNN 301 could predict GmWRKY2-bound sites. The prediction accuracy was above 82%, with a false 302 positive rate of less than 6% (Fig 5a). Furthermore, we explored the interspecific generalization ). For this analysis, the prediction accuracy was above 306 84% with a false positive rate of less than 7% (Fig 5a). 318 and a mean false positive rate of 1.53% (Fig 5b; Data S6). Similarly, the three AtNAC models 319 posed validation accuracies between 95.5-98%, test accuracies between 95-98%, and a false 320 positive rate ranging between 0.6-1.79% (Table 2). These models were used for predicting binding 321 sites for 15 other AtNAC TFs [66,67] with a mean prediction accuracy of 79.6% and a mean false 322 positive rate of 1.09% (Fig 5b; Data S6). These findings indicated that a model trained using the 323 TFBS of one family member could predict the TFBS of another member with reasonable accuracy.

307
324 Therefore, GmWRKY30 and GmRAV models were used to predict TFBS on promoters of the 325 DEGs from the transcriptome analysis (Fig 5c).  (Table 3). Both AtRAV and AtC2H2 models had training, validation, and 342 testing accuracies of less than 86% with increased false negative rates and were thus not used for 343 subsequent analyses (Table 3). Next, we generated AmpDAP-seq data for GmMYB61 (encoded 344 by Glyma.10G142200) (Fig S4; Data S5) and used AtWRKY and AtMYB CRNNs to perform 345 cross-species predictions on GmWRKY30 DAP-and GmMYB61 AmpDAP-seq data, 346 respectively. Both predictions demonstrated modest accuracies (approximately 61%), yet less than 347 1% false positive rates ( Table 4). We posited that, while Arabidopsis-to-soybean predictions 348 would likely miss some true TFBS, we could have confidence in those classified as bound.
349 Therefore, we used AtMYB, AtERF, AtNAC, and AtbHLH CRNNs to predict TFBS for promoter 350 regions of our DEGs (Fig 5c). Consistent with soybean CRNNs, all of the predicted TFBS were 351 overlapped with FIMO predictions to get a highly confident set of targets.
352  GRN inference underpinning host defense 356 The combination of soybean and Arabidopsis CRNNs allowed the prediction of TFBS 357 corresponding to 5,505 genes in the DEG set (Fig 6a). Global and family-level GRNs were thereby 358 constructed with TFs represented by nodes and target genes by edges (Fig 6b; Fig S5). ) and had to have corresponding genes expressed in the transcriptome analysis (Fig 6d). The 363 118 nodes meeting these criteria were prioritized by degree centrality, TF co-occurrence, and the 364 expression pattern of corresponding genes. Degree centrality was determined from outdegree 365 (number of edges directed to each node) and cumulative indegree (indegree = number of nodes to 366 which an edge is directed; cumulative indegree = combined indegree for all edges of a node). Both 367 measures were scale-free and displayed a power-law distribution (Fig 6d)  and selected cosine association score as 371 the objective similarity measure for co-occurring TF pairs (Fig 6c). Cumulative cosine (total 372 cosine association score across all co-occurrences) was determined for each node. Lastly, we 373 calculated the mean |log 2 FC| for node-corresponding genes across both interactions (R1 vs Mock 374 and R25 vs Mock) for further prioritization. Nodes/node-corresponding genes present in the upper 375 quartile for all four parameters were considered hubs (Fig 6d). A reciprocal approach prioritized 376 edges by indegree, cumulative outdegree (combined outdegree of all nodes to which an edge is 377 directed), sum cumulative cosine (combined cumulative cosine of all nodes to which an edge is 378 directed), and mean |log 2 FC| across both interactions (Fig S6).

379
Hub nodes corresponded to 14 genes encoding 13 TFs, all of which belonged to ERF and 380 WRKY families (Fig 6d,e). Thirteen of these were differentially expressed in the transcriptome 381 analysis and were present in co-expression clusters 1, 4, 5, and 8 (up-regulated clusters with 382 defense-related functional annotations). Furthermore, we assessed KEGG annotations for putative 383 hub node targets (n = 1,664), and 170 targets were annotated to one or more functions.  (Fig 6d)