Construction of transcript regulation mechanism prediction models based on binding motif environment of transcription factor AoXlnR in Aspergillus oryzae

Recent study revealed that there are thousands of genes that remain unaffected by increased AoXlnR expression, despite the presence of one or more AoXlnR-binding motifs in their promoter region. Given this knowledge, we designed this study to construct several predictive models for determining whether a gene can exhibit a differential response to changes in AoXlnR expression. These models were constructed using 3D DNA shape information determined using the sequence around the AoXlnR binding motifs with classification as functional or nonfunctional. These models were created using a support vector machine followed by the evaluations designed to determine whether these DNA shape-based models can correctly classify functional motifs in terms of area under the receiver operating characteristic curve. The results showed that the differential expression levels of genes located downstream of the AoXlnR motif are closely related to specific DNA shape information around the binding motifs. Furthermore, we found that the parameters contributing to differential expressions differed depending on the number of motifs in the promoter region by comparing the prediction models using regions with only one binding DNA motif and those with multiple binding DNA motifs. Author Summary DNA-binding transcription factors (TFs) play a central role in transcriptional regulation mechanisms, mainly through their specific binding to target sites on the genome and regulation of the expression of downstream genes. Therefore, a comprehensive analysis of the function of these TFs will lead to the understanding of various biological mechanisms. However, the functions of TFs in vivo are diverse and complicated, and the identified binding sites on the genome are not necessarily involved in the regulation of downstream gene expression. In this study, we investigated whether DNA structural information around the binding site of transcription factors can be used to predict the involvement of the binding site in the regulation of the expression of genes located downstream of the binding site. Specifically, we calculated the structural parameters based on the DNA shape around the DNA binding motif located upstream of the gene whose expression is directly regulated by the transcription factor AoXlnR from Aspergillus oryzae, and showed that the presence or absence of expression regulation can be predicted from the sequence information with high accuracy by machine learning incorporating these parameters.


Results
114 Construction of AoXlnR binding motif classification models using DNA shape 115 parameters 116 We identified 51 genes likely to be directly regulated by AoXlnR by 117 integrating information of the binding sites in A. oryzae genome obtained from 118 gSELEX-Seq [6] and the differential expression data obtained from the microarray 119 completed in a previous study [18]. The 194 DNA shape parameters (MGW:48, 120 ProT:48, HelT:49, and Roll:49) were used as explanatory valuables for constructing 121 classifying AoXlnR binding motifs as MPD (63 motifs) or MPnD (1506 motifs). Model 122 performances were evaluated with their AUC values (Fig. 2, Supplementary Fig 1). 123 AUC for the model constructed with selected DNA shape parameters was higher than 124 those for the models that used all the parameters. The selection of contributed DNA 125 shape parameters improved the performance of the prediction. When the same 126 procedure was applied to construct models using the antisense sequences of the AoXlnR 127 binding motifs, the obtained performances using selected shape features were 128 significantly higher than that using all shape features was the case with the original 129 AoXlnR-binding motif sequences (Fig. 2, Supplementary Fig. 1).
130 131 Stratification of differential expression prediction models based on the number of 132 binding motifs in promoters 133 The number of recognition sequences in a promoter was suggested to be 134 associated with its responsiveness to that TF previously [6]. Therefore, we stratified 135 MPDs according to the number of AoXlnR motifs present in their promoter regions, 136 separating them into two groups: one motif and two or more motifs. 137 The DNA shape parameters of the classified datasets were calculated and then 138 used to produce the prediction models as described above. When these models were 139 compared, we found that the models constructed using the top parameters with the 140 highest contribution to the expression differences showed significantly higher AUC 141 values than the models constructed using all the parameters, as in the case of using the 142 non-stratified MPDs (Fig. 3, Supplementary Fig. 2). Additionally, the model 143 constructed using the stratified dataset showed higher AUC values than those 144 constructed using the unclassified datasets. In particular, the classification by the 145 number of motifs tended to more highly contribute to increased AUC values when

150
This study was designed to evaluate the underlying mechanisms regulating 151 gene expression in response to A. oryzae TF, AoXlnR, from the viewpoint of DNA 152 conformation around the binding sites. We used datasets containing information on the 153 binding sites of AoXlnR in the A. oryzae genome and their differential expression in 154 response to the overexpression of this TF to construct mathematical models explaining 155 the differential regulation of AoXlnR binding sites [6,18]. 156 We first determined the impact of each of the 194 parameters for each AoXlnR 157 recognition region using 3D DNA shape information. These values were then used to 158 construct individual prediction models, followed by applying to predict responsiveness 159 to this TF. However, the AUCs of the models constructed using SVM were 0.644 and 160 0.522 for the sense and antisense strand sides, respectively, indicating that the accuracy 161 of the constructed models was insufficient to reliably distinguish between DEG and 162 non-DEG promoters (Fig. 2, Supplementary Fig. 1). These results suggest that the 163 inclusion of parameters with small contribution to the differential expressions via 164 AoXlnR prevented the construction of accurate prediction models.

165
Thus, we selected the top parameters in the order of their coefficient 166 contributing to the discrimination of differential expression to reduce the number of 167 explanatory variables used in the prediction models. These models were then 168 reconstructed using two to ten selected parameters. These models displayed AUCs of 169 more than 0.7 marking a significant improvement compared to the above models using 170 all parameters (Fig. 2, Supplementary Fig. 1). It should be noted that certain DNA 171 structural parameters were included in most of the models with higher AUCs. In fact, 172 one of the parameters, down_17_ProT, that is determined by the sequences around 17 b 173 downstream of the AoXlnR binding motifs, was preferentially selected. In the case of 174 model construction using the antisense strand of the motif, the DNA shape parameters 175 related to 15 b upstream of the binding motifs were observed at a relatively high 176 frequency. These preferentially selected parameters were calculated from the structure 177 of sequences outside of the conventional binding motif sequences, 5'-GGCTGA-3' or 5'-178 GGCTAA-3'. This fact strongly suggests the existence of a regulatory mechanism for 179 the expression of AoXlnR, which has not been revealed by conventional analyses based 180 only on TF-binding motif information. We are currently planning to introduce 181 mutations in the promoter region to alter the down_17_ProT parameters to understand 182 how this impacts gene regulation. 183 Next, we further classified the MPDs by the presence of other motifs in the 184 same promoter region, and then subjected the classified dataset to the construction of a 185 new set of models, followed by the evaluation of the performance of resultant models.
186 The results demonstrated that models constructed using the selected parameters which 187 contributed to the model performance showed higher AUC than the models constructed 188 under conditions with no classification by the number of motifs, including those using 189 antisense strands (Figs. 3, Supplementary Fig. 2). We also randomly stratified the In the models based on the DNA shape parameters around the binding motifs in 203 promoters of sense strand including only one motif, down_23_ProT was preferentially 204 selected as in all MPD-based models (Supplementary Fig. 2A), while up_17_Roll was 205 preferentially selected in the models produced for promoter with several binding motifs 206 ( Supplementary Fig. 2C). In contrast, the models produced using the antisense strand 207 showed a preference for up_17_ProT or up_11_HelT where the models were 208 constructed for single or multiple motif entries, respectively (Supplementary Figs. 2B 209 and D). The difference in the preferred DNA shape parameters among these models 210 resulted in improved AUCs, suggesting that the regulatory mechanism of AoXlnR in 211 cells depends on the number of motifs in the promoter region.

212
Interestingly, xynF1 (AO090103000423), which is known to be under the 213 control of AoXlnR [21], was more accurately discriminated as a gene regulated by 214 AoXlnR when the sequence around the MPD in the promoter was evaluated using 215 models constructed from MPDs with multiple motifs (100%) compared to the models 216 without classification (10%). On the other hand, randomly classified models provided 217 around 50% of xynF1's correct discrimination rate, which is much lower than that of the 218 models using dataset of MPDs with multiple motifs (data not shown). Taken together, 219 these results suggest that the mechanisms of expression regulation by AoXlnR in cells 220 can be classified by the number and orientation of the AoXlnR binding motifs.

221
The DNA shape information used in this study was recently reported to play an In our previous study, we divided the set of detected genes of A. oryzae into 256 two groups based on their differential expression in response to over-expression of 257 AoXlnR: 1) genes that retain one or more AoXlnR-binding sites in their promoter 258 regions and are expression-responsive to AoXlnR and 2) genes that retain AoXlnR 259 binding sites in the promoter regions but were not sensitive to AoXlnR overexpression 260    We constructed a set of supervised learning models to predict whether 287 AoXlnR-binding motifs, 5'-GGCT(G/A)A-3' in each gene promoter is in a DEG or non-288 DEG promoter region. The expression profiles of each of the 5'-GGCT(G/A)A-3' motif 289 were determined using unpublished microarray analysis data of AoXlnR overexpression 290 strain TFX2 and AoXlnR disruption strain SK253 conducted in a previous study [18]. 291 In this study, we used a support vector machine (e1071 package; version 1.7-3; 292 https://cran.r-project. org/web/packages/e1071) on R (version 4.0.1; https:// 293 cran.ism.ac.jp/) as the binary classification algorithm.

294
To balance the difference in size between the number of MPD and MPnD, we 295 selected the same number of MPnDs that showed the lowest response to AoXlnR 296 overexpression as MPDs and synthesized a 50 b dataset consisting of the same number 297 of MPDs and MPnDs. The expression variables, the DNA shape parameters were 298 tagged with their labels; 1 for MPD, 0 for MPnD. The performance of prediction model 299 was validated using leave-one-out cross validation. The area under the receiver 300 operating characteristic curve (AUC) was used as the model evaluation criteria. These 301 AUC values were calculated as the average from 10 sampling rounds.

302
The practical flow of building a prediction model is briefly described below 303 (Fig. 1). First, two gene expression profiles were compared to make two categories of 304 genes with AoXlnR-binding sites: the AoXlnR overexpression responsive genes, and 305 AoXlnR overexpression non-responsive genes. Second, from the gene lists, the MPD 306 and MPnD was searched. Third, from each MPD or MPnD, their sequence data of 50 b 307 was converted into the DNA shape parameters using DNAshapeR. Fourth, using these 308 datasets, model for predicting MPD or MPnD was constructed. In the construction of 309 prediction model, the DNA shape parameters that significantly contribute to the 310 differential expression of genes with the AoXlnR were ranked by calculating parameter 311 coefficients from the SVM linear kernel, and selected parameters with a higher 312 coefficient. Those prediction models were constructed using either (a) all DNA structure 313 parameters, or 2 to 10 selected parameters (2 to 10 parameters) in the order of their 314 coefficient (Supplementary Fig 1). Additionally, we constructed and evaluated a set of 315 predictive models using datasets where MPD and MPnD data were further stratified on 316 the number of motifs per promoter, i.e., one or more than one, using the same procedure 317 described above. In parallel, we randomly stratified the dataset including DNA shape 318 parameters of MPD as the same procedure with the preparation of the stratified dataset 319 considering the number of motifs. These randomly stratified dataset was used to 320 construct prediction models (Supplementary Fig. 2).