iHyd-ProSite: A novel Computational Approach for Identifying Hydroxylation Sites in Proline Via Mathematical Modeling

In various cellular functions, post translational modifications (PTM) of protein play a vital role. The addition of certain functional group through a covalent bond to the protein induces PTM. The number of PTMs are identified which are closely linked with diseases for example cancer and neurological disorder. Hydroxylation is one of the PTM, modified proline residue within a polypeptide sequence. The defective hydroxylation of proline causes absences of ascorbic acid in human which produce scurvy, and many other dominant health issues. Undoubtedly, the prediction of hydroxylation sites in proline residues is of challenging frontier. The experimental identification of hydroxyproline site is quite difficult, high-priced and time-consuming. The diversity in protein sequences instigates to develop a computational tool to identify hydroxylated site within short time with excellent prediction accuracy to handle such proteomics problems. In this work a novel in silico predictor is developed through rigorous mathematical modeling to identify which site of proline is hydroxylated and which site is not? Then performance of the predictor was verified using three validations tests, namely self-consistency test, cross-validation test and jackknife test over the benchmark dataset. A comparison was established for jackknife test with the previous methods. In comparison with previous predictors the proposed tool is more accurate than the existing techniques. Hence this scheme is highly useful and inspiring in contrast to all previous predictors.

In mammals collagens are extremely abundant protein comprised of proline modified 2 residue during the chemical process such as hydroxylation and produce 3 hydroxyproline [1]. Collagens are stringy and long in nature, most of the protein in 4 mammals consists of almost a quarter part of collagen [2]. In the treatment of wound 5 healing [3], burn and cosmetic surgeries [4,5] collagen mainly works as a medicinal 6 drug. Most of the dominant human diseases like stomach and lung cancer [6,7] are 7 closely related with the defects and irregularities in hydroxylation process. Thus the 8 identification of hydroxyproline (HyP) sites in proteins gives valuable data helpful to 9 both biomedical research and drug development [8]. Hydroxyproline obtained by the Conversion of proline (Pro) residue into hydroxyproline. The figure is to show that, hydroxylation action attaches the -OH group to proline (Pro) to convert CH group to COH and modify proline residue into hydroxyproline.
The number of scientists has been making their contributions [9][10][11][12] in order to 13 understand the cellular biological process and finding out medicines for cancer and for 14 various other diseases. The prediction of hydroxylation sites in the lab by using 15 method mass spectrometry [13] is difficult to conduct, expensive and very lengthy 16 process. Every day, the large number of protein sequences is collected in the data bank 17 and to classify them according to their functional properties is a crucial. It is highly 18 worthy to build an efficient computational predictor for the classification of targeted 19 hydroxylation sites within polypeptide sequences with improved prediction accuracy. 20 Many researchers have been developed a couple of methods in this regard. Still, all 21 these previous methods are insufficient to incorporate all components of features vague 22 in the polypeptide sequence that become difficult to get exact prediction. Many 23 scientists had been shown their great interest in hydroxylation process. Colgrave, et al. 24 [1] was computed quantification of hydroxyproline by using multiple reaction 25 monitoring mass spectrometry. In order to understand the microbial activity and their 26 communities, a mathematical model has been developed [14]. A system was defined to 27 study the insufficiency of collagen in connective tissues that encountered by lack of 28 ascorbic acid [15]. 29 Halme et al. [16] and kiviriko et al. [17] were explained the separation and 30 classification of extremely purified protocollagen proline hydroxylase as well as proline 31 hydroxylation in synthetical proteins with pure procollagen hydroxylase. In human 32 proteome, the functional character of proline and polyproline based on distribution, 33 frequency and positioning was investigated by Morgan, et al. [18]. Yamauchi et al. [19] 34 were elaborated the Hydroxylation of lysine and cross-linking of collagens. By using a 35 position weight of 8 high-quality amino acid indices and via support vector machines, 36 Shi, Shao-Ping, et al. [20] were proposed a novel technique named as PredHydroxy for 37 the forecast of the proline and lysine hydroxylation locales. Moreover, the functional 38 study of proline with mutable surroundings and the metabolism of proline, 39 hydroxyproline were examined in [21,22]. ZR Yang [23] developed a tool for the 40 prediction of hydroxyproline sites by utilizing support vector machine. A 41 sequence-based formulation for identifying hydroxyproline and hydroxylysine were 42 developed by Hu, Le-Le, et al. [24]. Using dipeptide position and specific propensity 43 into pseudo amino acid composition Xu, Yan, et al. [8] predicted hydroxyproline and 44 hydroxylysine in proteins. Qiu, Wang-Ren, et al. [25]  The acquiring of benchmark dataset is critical, as indicated by Chou's 5-step rule [26] 50 that prompts the attaining of a powerful, assorted and improved dataset. In order to 51 obtain a stringent benchmark dataset, two resources have been used in the current 52 study. One of the supported datasets is obtained from the universal protein database 53 http://www.uniprot.org/, while the other dataset is borrowed from a posttranslational 54 modification database dbPTM 3.0 [42]. Thus, a stringent benchmark datasets are 55 obtained by employing the following two steps.

56
Step-1: The extracted dataset from UniProt database, contains two sets of protein 57 sequences. One of the set represents hydroxylated protein sequences at proline site 58 and labeled as positive sample. Likewise, other set consists of non-hydroxylated 59 protein sequences at proline site, tagged as negative sample. An inquiry is produced to 60 choose polypeptide sequences in the PTM/processing field as hydroxyproline. Records 61 construed with any experimental assertion in Feature Table (FT) were only chosen.

62
After a thorough selection of the described query, a stringent benchmark dataset of 63 hydroxyproline was obtained. There were found records of 816 and 24980 for 64 hydroxylated and non-hydroxylated sequences. The records were reduced to 782 and 65 24971 respectively, after removing duplicates.

66
Step-2: Likewise, to obtain another stringent benchmark dataset the dbdtm 3.047  In order to identify target proline sites with hydroxylation, an excellent methodology 74 is proposed as indicated in the Chou's second and third step [26]. This technique is 75 developed by incorporating all indispensible components of polypeptide sequences that 76 can perfectly indicate their correlation to assemble the sequence in an effective way.

77
The alternate formulation was also employed by Ehsan et al. [31,32], impart as 78 prominent prediction rate in proteomics problems. Consider a protein sample C 79 consists of Z amino acid residues.
Where U 1 indicates the first amino acid residue with in string C, U 2 is the second Where the weight factors T 1 , T 2 , T 3 , ..., T n depends upon the repeated terms of the  (2). Whereas i depends upon the number of compositions of residue of type 105 r in concatenation. Moreover, non-occurrence will assign zero value corresponding to 106 the weight factor, so this weight is neglected and only considered the weight factors for 107 has occurred objects. separately assigned in (6). The matrix (6) is adopted by constraint (7), when the pair 116 χ(U i , U j ) appeared, then ω i,j gives 1 otherwise it is attributed as number zero.
In order to understand the mechanism of proposed model consider ith term U i of 122 sequence (1), reflects the first alphabetical letter of amino acid residues, say, "A". A mechanism for sequence formulation. The figure is to show the graphical demonstration of scheme feature vector for the residue "A", representing how "A" make pairs with its contiguous residues in both directions up to next residue. Fig. 2). This process will be continued until next U j occurs at jth position such that Similarly, the same steps will be conducted for U j . The feature 128 component corresponding to residue "A" is interpreted in Eq. (10). 135 . . .
The above set of twenty feature vectors depends upon three properties of amino acids 136 such that, hydrophobicity, hydrophilicity and side chain mass of amino acids, can be 137 calculated by employing Eqs. (12) to (14). These equations can expand as per choice 138 of attributes of amino acids other than these three properties of amino acid. For 139 extended properties l of amino acids a compact representation is elaborated in Eq.
February 28, 2020 7/15 Or 142 Whereas ℵ * i1 , ℵ * i2 , ℵ * i3 are the values of hydrophobicity, hydrophilicity and side-chain 143 mass of amino acid residues that are normalized by using Eq. (16) against the pair of 144 U i and U j . The normalization index that is used to normalize the values given in Eqs. 145 (12) to (14) lies between (-S, S), where S is the normalizing count forr amino acids.

146
Here the number 5 is used for normalization. The original values for hydrophobicity 147 and hydrophilicity were taken from the main source, employed by Ehsan at el. [31,32], 148 while the values for side-chain mass of amino acid residues was taken from any text 149 book of biochemistry.

160
The neural network is an extraordinary tool for decision making problems and to 161 classify patterns in available diversified data sets. It is typically arranged in layers and 162 learn from its experience using input data and able to modify their weights according 163 to provided data. Subsequent to the training process is finished the system apparently 164 acts such that makes it fit to arrange each given input inside a worthy level of

176
In order to build up a beneficial predictor for an organic development, the Chou's 177 5-step rule [26] are noticeable. Undoubtedly, it is useful to develop a new predictor by 178 employing Chou's 5-step rule. A number of researchers [27][28][29][30]  sample may simultaneously belong to several classes), whose existence has become 197 more frequent in system biology [28,35], system medicine [36] and biomedicine [37], a 198 completely different set of metrics as defined in the study represented as a 199 reference [38] is absolutely needed.
Consider ¶ + and ¶ − represents the all correctly predicted positive and negative  predictors [8,25]. In the current study, to test the performance of proposed scheme all 229 above test methods were employed. Additionally, for validation purpose the 230 benchmark datasets were taken from two sources, one is from uniprot and other one is 231 from dbptm. The results obtained by using both datasets are given in Table 1. Table 1

238
Observe Table 2 for a comparison analysis with the existing techniques 239 "iHyd-PseAAC" [8], and "iHyd-PseCp" [25]. The comparison was also made with the 240 most recent publication "iHyd-PseAAC (EPSV)" [32] for identifying the 241 hydroxyproline sites. All these techniques attained the metrics records, employing the 242 jackknife test method. It can be noticed from Table 2   This concept allows the predictor to meticulously separate and appropriately 257 distinguish each instance. Third, the correlation aspect is the principle concept that 258 impart for computing feature vector. It has been assembled by considering each 259 attribute group. Each expression deals with some specific metric and statistical 260 measures. For the sake of convenience, every property of amino acids was standardized 261 numerically within a suitable range. Also, it has been noticed that in comparison with 262 previous methods proposed, the predicted outcomes are more superior and better than 263 the former prediction rate. Table 2. A comparison analysis of the proposed predictor with the existing predictors using well-known jackknife validation tests for the metrics given in Eq. (17).

264
Comparison Table  P  made for jackknife test using benchmark datasets obtained from (a)dbptm and (b)uniprot database sources. It can be seen that the results obtained by using proposed predictor "iHyd-ProSite" is much better than all previous methodologies. Table 2 explain that, the values of sensitivity, specificity, accuracy and methew 266 correlation coefficient for proposed predictor are higher than all the values obtained by 267 utilizing former schemes. Sensitivity test describes the correctly predicted 268 hydroxylated sites which are extraordinary larger than all reported values for previous 269 methodologies. Also the stability of the predictor is measured by MCC value, and it 270 can be observed that MCC values obtained by using proposed scheme are greater than 271 above reported values. Undoubtedly, the proposed scheme is much helpful for 272 diagnosing the biological problems efficiently.