iEnhancer-CLA: Self-attention-based interpretable model for enhancers and their strength prediction

Enhancer is a class of non-coding DNA cis-acting elements that plays a crucial role in the development of eukaryotes for their transcription. Computational methods for predicting enhancers have been developed and achieve satisfactory performance. However, existing computational methods suffer from experience-based feature engineering and lack of interpretability, which not only limit the representation ability of the models to some extent, but also make it difficult to provide interpretable analysis of the model prediction findings.In this paper, we propose a novel deep-learning-based model, iEnhancer-CLA, for identifying enhancers and their strengths. Specifically, iEnhancer-CLA automatically learns sequence 1D features through multiscale convolutional neural networks (CNN), and employs a self-attention mechanism to represent global features formed by multiple elements (multibody effects). In particular, the model can provide an interpretable analysis of the enhancer motifs and key base signals by decoupling CNN modules and generating self-attention weights. To avoid the bias of setting hyperparameters manually, we construct Bayesian optimization methods to obtain model global optimization hyperparameters. The results demonstrate that our method outperforms existing predictors in terms of accuracy for identifying enhancers and their strengths. Importantly, our analyses found that the distribution of bases in enhancers is uneven and the base G contents are more enriched, while the distribution of bases in non-enhancers is relatively even. This result contributes to the improvement of prediction performance and thus facilitates revealing an in-depth understanding of the potential functional mechanisms of enhancers. Author summary The enhancers contain many subspecies and the accuracy of existing models is difficult to improve due to the small data set. Motivated by the need for accurate and efficient methods to predict enhancer types, we developed a self-attention deep learning model iEnhancer-CLA, the aim is to be able to distinguish effectively and quickly between subspecies of enhancers and whether they are enhancers or not. The model is able to learn sequence features effectively through the combination of multi-scale CNN blocks, BLSTM layers, and self-attention mechanisms, thus improving the accuracy of the model. Encouragingly, by decoupling the CNN layer it was found that the layer was effective in learning the motif of the sequences, which in combination with the self-attention weights could provide interpretability to the model. We further performed sequence analysis in conjunction with the model-generated weights and discovered differences in enhancer and non-enhancer sequence characteristics. This phenomenon can be a guide for the construction of subsequent models for identifying enhancer sequences.

Traditional enhancer prediction is mainly conducted by biological experiments, such 22 as comparing enhancers with TFs [15][16][17]. However, the biological approach to identify 23 enhancers through experimental techniques is time consuming and inefficient. Thus, 24 many computational methods, such as CSI-ANN [18], ChromeGenSVM [19], rfECS 25 [20] have been proposed in recent years to predict enhancers and mitigation the 26 limitations of experimental methods. 27 Enhancers are a large set of functional elements consisting of many different 28 subgroups [21], such as strong enhancers, weak enhancers, inhibitory enhancers, and 29 inactive enhancers. Early computational methods are unable to identify the strength of 30 enhancers. For example, EnhancerFinder, GKM-SVM and DEEP [22][23][24]. 31 iEnhancer-2L [25] is first proposed to solve the enhancer strength prediction problem 32 using a two-layer model. The accuracy is difficult to improve effectively because of the 33 small dataset and insufficient optimization of the base classifier. Despite, many models 34 are proposed to improve the accuracy of predicting enhancers and their strengths such as 35 EnhancerDBN [26], iEnhancer-EL [27], iEnhancer-5Step [28], iEnhancer-ECNN [29] 36 but the accuracy improvement was not significant. The latest model iEnhancer-GAN 37 [30],uses adversarial neural networks to compensate for the significant improvement in 38 enhancer recognition accuracy from small datasets. However, current machine learning 39 methods are still in the black-box learning stage and it is difficult to make interpretable 40 analysis of the model. Although, our previous work iEnhancer- XG [31] have made 41 interpretable analysis of the feature extraction of the model. Nevertheless, there is still 42 much room for improving the interpretability of models and sequences. 43 In this study, a novel deep learning framework called iEnhancer-CLA is proposed to 44 identify enhancers and their strengths. The model provides a visualization of the model 45 by decoupling the CNN layers to extract position weight matrices (PWMs) and 46 generating the self-attention weights. Provide sequence interpretability by using and their strengths. The model could also be extended to predict more enhancer 52 subtypes for future prediction development. is applied. In this frame work, the multi-head self-attention mechanism is used to assess 84 the contribution of sequence regions for localization by multiple heads (head=6), which 85 has the ability to generate self-attention weights for each sequence during the prediction. 86 Stacking of deep learning models at different scales gives the model the ability to adapt 87 to more complex higher order functions. Then, the outputs of the two multi-head  The multi-head self-attention model is first established to be used in natural language 94 processing, proposed by [32]. This approach is mainly able to effectively address the 95 problem that the overall semantics of a sentence is composed of multiple components, On the basis of this weight, which part of the enhancer determines whether the 101 sequence is an enhancer or not and the strength of the enhancer could be analyzed. The 102 attention matrix A must first be calculated using Eq (1) to obtain the weights.
where H is the 50-by-32 hidden neuron of the path output obtained from each 104 BLSTM layer; W s1 is a weight matrix of shape D − by − 32; D is the attention 105 dimension hyperparameter, which is 60; W s2 is the matrix of the parameters with shape 106 heads − by − D; and the head denotes the number of attention heads, which is 6. W s1 107 and W s2 are set to the same weight loss values to overcome overfitting and obtain 108 sparse energy scores. The embedding sequence M , calculated as a weighted sum by 109 multiplying A and H as in Eq (2), is also retained for further prediction.
When training the model, the following penalty term P in Eq (3) is introduced.
where A is the attention matrix, I is an identity matrix, and · F denotes the 112 Frobenius parametrization of the matrix. The greater the similarity between these two 113

141
• The most recent tool "STREME [36]" was used to discover sequence ungapped 142 motifs. Using this motif as a sequence motif is compared with PWMs obtained in 143 CNN convolution kernels on the "TOMTOM [37]" tool. subsets. The hyperparameters of the sub-model with the highest accuracy are set as the 153 global optimal model. The Adam stochastic optimization method [38] is also used, with 154 a learning rate of 0.001 and a learning rate decay of 5e-5.

155
The iEnhancer-CLA model uses five-fold cross-validation to dividing the dataset into 156 five parts, four of which are applied for model training and the remaining one for evaluation. The final iEnhancer-CLA model is a collection of five models, and the 158 predictions of the enhancers are the average of the predictions of the five models. For a 159 classification problem, the training goal is to distinguish as little as possible between the 160 prediction vector and the true label vector. The prediction of an enhancers sample is 161 denoted as x i , and its corresponding true label vector is denoted as y i . Each element of 162 y is a binary value, denoted as y ij , j ∈ [1, 2, 3], indicating whether the enhancer sample 163 belongs to a certain positional category. If y ij is 1, then x i belongs to category j; 164 otherwise, it belongs to 0. A binary cross-entropy loss function is applied to process 165 each category independently. The loss of sample x i is defined as follows: where p ij ∈ [0, 1]is an element of p i , which represents the prediction vector of sample 167 x i . Given the three locus categories, the loss of sample x i is the sum of the losses of the 168 individual locus categories. In the construction of this deep learning model, the    non-enhancer were higher, which proved that these two bases appeared more frequently 203 and more abundantly in the sequences. The higher signal intensity of bases G and C in 204 strong enhancers proves that bases G and C are more enriched in strong enhancers.

205
In "ggseqlog", the difference in the intensity of the base signals of the enhancers and 206 non-enhancers were found at the first and last ends. Some problems from it were also 207 found, as the base signals of the strong and weak enhancers are not obvious. One 208 possible reason is that the same sequence features do not appear at the same positions. 209 Therefore, the binning signals in the middle of the enhancer sequences were analyzed 210 and visualized using the "GLAM2" tool in the MEME suite, a method that could find 211 the same sequence features at different positions in different sequences. In Fig 3, the 212 localization signals of 300 randomly selected non-enhancer, strong-enhancer, and 213 weak-enhancer sequences were visualized. All parameters were used as default values for 214 the "GLAM2" online tool.

215
The internal signals of two different enhancers and non-enhancers were obtained in 216 "GLAM2". As shown in Fig 3,    We have combined the two methods to conclude the following from Fig 2 and 3.

225
The distribution of bases in different sequences in the non-enhancer is similar, with 226 higher intensity of base A and T signals. The most frequent base combination in the 227 non-enhancer was found to be (RE:  Using "GLAM2", the base sequence of this sequence was found to be (RE:

234
A/T-A/G-A). Thus, it is concluded that base G is more enriched in strong enhancers. 235 The different weak enhancer sequences have similar base distribution and higher 236 intensity of base A and T signals. The most frequent base combinations in the weak 237 enhancers were found by "GLAM2" was (

RE: A/C-A/T-A/CG-A/G-A/G-A/G-A/G-X-238 A/CG-A/G-A/G-X-A/G-A/G-A/G-A/G-X-A/G-A/G-A/CG-A/T-A/G). Thereby, the 239
base A is more enriched in the weak enhancer.

240
Analysis of CNN motifs 241 We first used "STREME" tool to analyze the three sequences separately, and obtained 242 10 ungapped motifs from non-enhancer sequences, 4 ungapped motifs from strong  Finally, we use the TOMTOM tool to map the motifs learned from each convolutional 247 kernel to each of the three different sequences motifs obtained in STREME. also shows that the CNN layer is able to learn useful sequence features in the model.

253
The motifs of the strong and weak enhancers also have matching PWMs respectively.

254
Combining the above 6 sets of comparisons, we can conclude that in this model, the 255 first layer using CNN layer is able to learn many useful sequence features that 256 contribute to the model results. At the same time, the ungapped motif obtained in "STREME" is combined with the 258 results obtained by the previous two tools to further prove our conclusion. The high 259 signal intensity of bases A and T in the non-enhancer motifs in Figures 4A and 4B 260 proves that the content of bases A and T in the non-enhancer sequence is higher than 261 other bases, and the combination of bases A and T is also characteristic of the 262 non-enhancer sequences. Similarly, Figure 4C and Figure 4D show the motif obtained 263 for the strong enhancer in "STREME" shows that the signal intensity of bases G and C 264 is higher than that of other bases. It proves that bases G and C are more enriched in 265 strong enhancers, and the combination of bases G and C is the sequence characteristic 266 of strong enhancers. Motif obtained for the weak enhancer in "STREME" in Figure 4E 267 and Figure 4F shows that the signal intensity of bases A and G is higher than that of 268 other bases. It proves that bases A and G are more enriched in weak enhancers, and the 269 combination of bases A and G is the sequence characteristic of weak enhancers.

270
Visualization of attention weights on sequences 271 In this work, the self-attention layer is able to learn the features of each sequence and 272 generate the self-attention weights of the sequence. With the weights, we can find the 273 differences in base combinations between different kinds of sequences. First, sequences 274 were randomly selected among three different sequences, and the attention weights were 275 visualized separately as shown in Fig 5. Then 10 sequence weights in each sequence are 276 selected separately and averaged as shown in Figure 5D. 277 The base sequences that reached the largest area of the peak region were extracted 278 from the randomly selected sequences to determine what combination of bases could give the sequences higher weights. In Figure 5A, the base sequences of the three  The average values of the weights for the three kinds of sequences are shown in 304 Figure 5D. Although the general trends are the same, some differences exist between 305 each kind of sequence. When the sequence position is at 1 to 30nt, the weight of weak 306 enhancer is significantly higher than the others. At 60 to 80nt, the weight of 307 non-enhancers is higher than that of enhancer sequences. At 120 to 140nt, the weight of 308 the strong enhancer is higher than the other kinds of sequences, and the weight has an 309 obvious upward trend, while the weak enhancer has an obvious downward trend in this 310 segment. This result shows that the main differences between strong enhancer, weak 311 enhancer, and non-enhancer sequence features are concentrated in three positions

314
The PWMs extracted in the CNN layer and the sequence self-attention weights 315 generated in the self-attention layer to see the model can learn the sequence features of 316 different sequences to contribute to the prediction. We can also clearly understand what 317 sequence features are learned in the first layer of the model and what sequence features 318 are learned in the last layer of the model, thus achieving interpretability of the model. 319 Also, the base sequence corresponding to the highest value of self-attention weight 320 for each sequence combined with the results of the previous three methods were able to 321 demonstrate the differences in base combination and base content of the three different 322 sequences.

323
Comparison between iEnhancer-CLA and existing methods and 324 tools 325 Given that the proposed model is a first single-layer multiclassification enhancer 326 prediction model, comparing it with other existing two-layer models was difficult. Thus, 327 several classical enhancer two-layer models were selected for comparison.

328
The proposed method was compared with the classical method on the same 329 independent-test dataset to objectively evaluate the prediction performance. The

330
following four evaluation metrics were also selected for evaluation: ACC, AUC, SP, and 331 MCC.

332
As shown from the results in Table 1, although the work was multi-classified, the 333 prediction results were still higher than those of classical model methods under the 334 chosen evaluation metrics. In particular, for the prediction of strong and weak 335 enhancers, the ACC values were significantly higher than those of the previous classical 336 models. The established model was able to perform multi-classification prediction in a 337 single layer, and it showed better accuracy than the previous classical models. Due to the previous two-layer model, one layer was used to distinguish whether it is an 340 enhancer, and one layer was used to distinguish the strength of the enhancers. This 341 computational method, after years of research and development, was able to distinguish 342 enhancers more accurately than before. However, the two-layer model also limits the 343 identification of enhancers. Many subtypes of enhancers exist, not only strong and weak 344 enhancers. In future research, the role of more subtypes of enhancers and the use of 345 computational methods to distinguish and predict must be developed and studied. The 346 importance of the proposed multi-labeled single-layer multiclassification prediction 347 model was remarkably demonstrated in the present work, of which the results in Table 2 348 were obtained using five-fold cross-validation. Table 2 shows the results for the training 349 set. We put the results of the independent test set and the results of the non-enhancer 350 sequences and enhancer sequences balanced data set in the (supplementary file, sec.3). 351  identify enhancers more accurately than others, as indicated by the results of the 361 independent-test set.

362
The model was able to generate sequence-related attention weights and visualize the 363 sequences by using sequence analysis software with "ggseqlogo", "GLAM2" and 364 "STREME". The differences between the two kinds of enhancers and non-enhancers 365 were identified, which could be useful for future research on enhancers. The