Capsule network for protein ubiquitination site prediction

Ubiquitination modification is one of the most important protein posttranslational modifications used in many biological processes. Traditional ubiquitination site determination methods are expensive and time-consuming, whereas calculation-based prediction methods can accurately and efficiently predict ubiquitination sites. This study used a convolutional neural network and a capsule network in deep learning to design a deep learning model, “Caps-Ubi,” for multispecies ubiquitination site prediction. Two encoding methods, one-of-K and the amino acid continuous type were used to characterize the sequence pattern of ubiquitination sites. The proposed Caps-Ubi predictor achieved an accuracy of 0.91, a sensitivity of 0.93, a specificity of 0.89, a measure-correlate-prediction of 0.83, and an area under receiver operating characteristic curve value of 0.96, which outperformed the other tested predictors.


Introduction
Ubiquitination is an important posttranslational modification of proteins, consisting of the covalent binding of ubiquitin to a variety of cellular proteins. Ubiquitin was discovered in 1975 by Goldstein et al. [1]; it is a small protein composed of 76 amino acids [2]. Ubiquitination is the process of covalently binding the lysine of a substrate protein to the small ubiquitin molecule under the action of a series of enzymes. Three enzymes are involved in the process: E1 activation, E2 conjugation, and E3 ligation. Ubiquitination modification plays a very important role in basic reactions such as signal transduction, cell diseases, DNA repair, and transcription regulation [3][4][5][6].
Due to the important biological characteristics of ubiquitination, identifying potential ubiquitination sites helps to understand protein regulation and molecular mechanisms. Determining ubiquitination sites based on traditional biological experimental techniques such as mass spectrometry [7] and antibody recognition [8] is costly and time-consuming. Therefore, it is necessary to develop a calculation method that can accurately and efficiently recognize protein ubiquitination. In recent years, some calculation methods have been developed to predict potential ubiquitination sites.
Huang et al. [9] used amino acid composition (AAC), a position weighting matrix, amino acid pair composition (AAPC), a position-specific scoring matrix (PSSM), and other information to develop a predictor called UbiSite using a support vector machine (SVM). Nguyen et al. [10] used an SVM to combine three kinds of information: AAC, evolution information, and AAPC to develop a predictor. Qiu et al. [11] developed a new predictor called "iUbiq-Lys" to apply to sequence evolution information and a gray system model. Chen et al. [12] also applied SVM to build a UbiProber predictor. Wang et al. [13] introduced physical-chemical attributes into an SVM to develop the ESA-UbiSite predictor. Radivojac et al. [14] developed the predictor UbPred using a random forest algorithm. Lee et al. [15] developed UbSite using efficient radial basis functions. All of those machine learning-based methods and predictors have promoted the development of ubiquitination site prediction research and achieved good prediction performance. However, most of them rely on artificial feature selection, which may lead to imperfect features [16], and their datasets are small despite the large volume of accumulated biomedical data.
Deep learning, the most advanced machine learning technology, can handle large-scale data well. It has multilayer networks and nonlinear mapping operations, which can fit the complex structure of data well. In recent years, deep learning has been developed rapidly [16] and has been successfully applied in various fields of bioinformatics [17,18]. Some methods based on deep learning have been used for ubiquitination site identification. For example, Fu et al. [19] applied one-hot and composition of k-spaced amino acid pairs encoding methods to develop DeepUbi with text-CNN. Liu et al. [20] used deep transfer learning methods to develop the DeepTL-Ubi predictor for multispecies ubiquitination site prediction. He et al. [21] established a multimodel predictor using one-hot, physical-chemical properties of amino acids, and a PSSM.
Although various ubiquitination site predictors and tools have been developed, there are still some limitations, and their accuracy and other performance elements must be further improved. In this paper, a deep learning model, "Caps-Ubi," is proposed that uses a capsule network for protein ubiquitination site prediction. In Caps-Ubi, the protein fragments are first passed through one-of-K and amino acid continuous methods to encode them. Then three convolutional layers and the capsule network layer are used as a feature extractor to obtain the functional domains in the protein fragments and finally to get the prediction result. Relative to existing tools, the prediction performance of Caps-Ubi is a significant improvement. Researchers could use the predictor to select potential ubiquitination candidate sites and do experiments to verify them, which will reduce the range of protein candidates and save time.

Benchmark dataset
The ubiquitination dataset came from the largest online protein lysine modification database, To eliminate errors caused by homologous sequences, we used CD-HIT [22] to filter out homologous sequences with sequence similarities greater than 40%. We obtained 12,100 proteins and 54,586 ubiquitination sites, which were used as a positive sample set. Based on those annotated sequences, 427,305 nonubiquitinated sites were extracted from the proteins as a negative sample set, and CD-HIT-2D [23] was used to filter out homologous sequences within the positive sample set that were greater than 50%. To establish a balanced training model, we randomly selected the same data as the positive sample set and selected 90% of it as the training and validation sets and 10% as the independent test set. Finally, 53,999 data on ubiquitination sites and 50,315 data on nonubiquitination sites were obtained. The final data division is shown in Table 1.

Input sequence coding
The coding method directly determines the quality of its prediction results; a good feature can extract the correlation between the ubiquitination feature and the targets from peptide sequences [24]. After encoding the protein sequence, the sequence information is converted into digital information, and then deep learning is done on it. In this study, two methods were used to encode the amino acid sequence around the protein ubiquitination site; namely, one-of-K encoding and amino acid continuous encoding.

One-of-K encoding
The one-of-K encoding method was adopted for protein fragments, and each protein fragment was encoded into an m × k 2D matrix, where m is the number of amino acids in each sequencethat is, the length of the input sequence-and k is the type of amino acid. There are 20 kinds of common amino acids. When the length of the input sequence did not reach the window length, it was filled in with a "-" on the left or right side of the protein fragment and was treated as another amino acid, so each sequence consisted of 21 amino acids.

Continuous coding of amino acids
The continuous amino acid coding method [25] was proposed by Venkatarajan; the coding uses 237 physical-chemical properties to quantitatively characterize 20 amino acids. They used five main components to characterize the changes in 237 physica-chemical properties of amino acids. In this paper, each amino acid is represented by a 6D vector, wherein the first 5D represents the five principal components as shown in Table 1 of [25], the last 1D represents the gap in the input protein fragment with a length of m. The gap is represented by a dash"-", meaning that when the sequence length does not reach the window length, the bit is coded as 1; otherwise, it is 0. Finally, each protein fragment is coded into an m × 6 2D matrix. This continuous coding scheme can comprehensively consider the physical and chemical properties of protein amino acids and has a smaller dimension than that of one-of-K coding. The smaller input dimension will lead to a relatively simple network structure, which is beneficial to avoid overfitting.

Capsule network
In a CNN, the pooling layer can extract valuable information from the data, but some location information is lost [26].
where is the coupling coefficient, which is obtained by a softmax transformation from ; its calculation equation is In Eq. (4), the sum of the coupling coefficients of all capsules and capsule in the previous layer is 1. The coupling coefficient is obtained through a dynamic routing mechanism; the pseudocode is as follows: procedure ROUTING ( | ,r,l) where is the number of categories, is the real label ubiquitinated to 1 and nonubiquitinated to 0, || || is the output length of the kth capsule, which is the probability of predicting the kth class. The boundary + is 0.9, which is a penalty for false positives, and the lower boundaryis 0.1, which is a penalty for false negatives. is a proportional coefficient of 0.5, which is used to control the loss caused when some categories do not appear ， to prevent the capsule vector length of all categories from being reduced in the early stage of training，and the total loss is the sum of the losses of categories.

Architecture design
As shown in Figure 1, the structure of the proposed model contains two identical subnetworks that process one-of-21 and amino acid continuous encoding modes. After training in their respective network model, the two models merge the features as the final output. Each subnetwork consists of the same three 1D convolutional layers (Conv1, Conv2, Conv3) and a capsule network layer. The first convolutional layer (Conv1) of the network is a 1D convolution kernel, which comprises 256 convolution kernels with a size of 1 and a step size of 1 that use the ReLU activation function. A convolution kernel with a length of 1 first appears in the Network in Network [29]; a convolution kernel with a length of 1 can reduce the complexity of the model and can make the network deeper and wider. Applied in this study, it acted as a feature filter and could pool features in two encoding modes. The second convolutional layer, Conv2, is a conventional convolutional layer with 256 1D convolution kernels with a length of 7 and a step size of 1, which functions as a local feature detector to extract the protein sequence input and convert it to corresponding local features. Conv2 is understood as the functional domain characteristics of the protein, and its output is used as the input of the next layer, Conv3. The third convolutional layer, Conv3, has 256 1D convolution kernels with a size of 11 and a step size of 1. The activation function used is ReLU and a dropout mechanism with a random deletion rate of 0.3. The dropout mechanism is used to prevent the model from

Model training
For model training, we used the Adam [30] optimization algorithm. Adam can automatically adjust the learning rate of the parameters, improve the training speed, and improve the stability of the model. The learning rate was 0.003, the first-order estimated exponential decay rate was 0.9, and the exponential decay rate estimated by the second moment was 0.999. The dynamic routing mechanism was consistent with that in the original paper [26]. The number of routing iterations was 3, and the boundary loss function was used as the loss function of the model. The boundary loss function form is shown in Eq. (5). and the number of model training iterations was 50 epochs. The deep learning framework used by this model was Keras 2.1.4. Keras is a highly modular deep learning framework based on Theano and written in Python; it supports both CPU and GPU. The programming language was Python 3.5, and the model was trained and tested on a Windows 10 system equipped with an Nvidia RTX 2060 GPU.

Model evaluation and performance indicators
A confusion matrix is a visual display tool used to evaluate the quality of classification models.
Each row of the matrix represents the actual condition of the sample, and each column represents

Experimental results
First, we did many experiments on the selection of the window size of protein fragments.
Because the correlation information between amino acids had a direct effect on the prediction results, we needed to determine an appropriate window size. Previous studies directly used empirical values such as 21, 33, or 49. However, different data models and classifiers tend to have different window sizes [31]. Therefore, a window length of n was selected from a range of 21 to 75, and we did a series of experiments with the different window lengths. For each window length, we encoded all training data into two input modes and trained their respective subnetworks. According to the prediction results of the validation set, we selected each appropriate window size. Figure 2 shows the performance of various window sizes in one-of-21 and amino acid continuous encoding modes. We set the LabelCapsule layer to a 128 × 1 fully connected layer. The comparison results are shown in Table 2.  Figure 3, which shows that it was very close to the real situation. When we used balanced data to train the model on an experimentally verified ubiquitination dataset and a nonubiquitination dataset [19], the ratio of positive peptides and negative peptides was 1:8, so we tested Caps-Ubi using natural-distribution data. The test results are shown in Table 3.
According to the test results, the performance was slightly worse than that under the balanced data.

Comparison with other methods
In the past 10 years, many researchers have contributed to the prediction and research of protein ubiquitination sites. We compared the proposed model with other sequence-based prediction tools.
The corresponding data and results are shown in Table 4, which shows that the performance of the

Conclusion and outlook
In this paper, a new deep learning model for predicting protein ubiquitination sites is proposed, using one-of-K and amino acid continuous coding modes. We used the largest available protein ubiquitination site dataset, and the experimental results above verify the effectiveness of this model.
The operation of the model has four main steps: encoding protein sequences, constructing convolutional layers, constructing a capsule network layer, and constructing an output layer. The capsule network introduces a new building block for deep learning. Relative to CNN, the capsule network, which uses a dynamic routing mechanism to update parameters, requires more training time, but the time required for prediction is similar. The capsule network can also characterize the complex relations among amino acids in various sequence positions and can explore the internal data distribution related to biochemical significance. The proposed Caps-Ubi prediction tool will facilitate the sequence analysis of ubiquitination and can also be used to identify other posttranslational modification sites in proteins. In the future, we will study other features that may better extract sample attributes to construct deeper models.