Exploring an experiment-split method to estimate the generalization ability in new data: DeepKme as an example

A Large number of predictors have been built based on different data sets for predicting different post-translational modification sites. However, limited to our knowledge, most of them gave an overfitting estimation of their generalization ability in new data because of the intrinsic trait—not considering the experimental sources of the new data—of the cross-validation method. Thus, we proposed and explored a new method—the experiment-split method—imitating the blinded assessment to deal with the overfitting problem in the new data. The experiment-split method logically split the training and test data based on the data’s different experimental sources, and the new data can be regarded as the data from different experimental sources. To specifically illustrate the experiment-split method, we combined an actual application, DeepKme—a predictor built by us for the lysine methylation sites, to demonstrate how it be used in the true scenarios. We compared the cross-validation method with the experiment-split method. The result suggested the experiment-split method could effectively relieve the overfitting compared with the cross-validation method and may be widely used in the field of identification participated by multiple experiments. We believe DeepKme would facilitate the related researchers’ deep thought of the experiment-split method and the overfitting phenomenon, and of course, advance the study of the lysine methylation and similar fields.

63 learning also were built for the prediction of lysine methylation sites even if may not be targeted.
64 However, the fatal shortcoming of the computational prediction is the false-positive proportion is 65 high and has always been far miscalculated because of the unrepresentative test set [26,[28][29][30][31][32].
66 Although there may be predictors which would be vetted very well in the future, some 67 experimentalists would more hope to get actionable predictor right away having been strictly 68 assessed imitating most of the practical rather than investigate many predictors by themselves 69 which would be very troublesome because of the big variation through the data sets, modeling 70 methods, programming tools, and availabilities. The core of the overfitting phenomenon is the 71 construction method of the test set, and a traditional way is the cross-validation method. Limited to 72 our knowledge, most of the predictors gave an overfitting estimation of their generalization ability 73 in new data [29,30]. The immediate cause, the objective reason, the subjective reason, and the root 74 causes are as follows. 75 1 The immediate cause is the test set can not unbiasedly represent the new data, in other 76 words, the test set and the new data have a different distribution. 77 2 The objective reason is any old data can not unbiasedly represent the new data. 84 Despite all these, it's never unallowed to propose some new methods to try to solve the problem.
85 Due to we can't stop the above objective reason and root cause, we proposed and explored a new 86 method called the experiment-split method trying to solve the problem through changing our 87 subjective factors-the cross-validation which only considers the general key distinguishing a 88 sample from the others. In the experiment-split method, we used the higher-level key-the 89 experimental source-to split our data, and we put the data from one experimental source each time 90 on the test set and the other data on the training set, while the cross-validation is that after splitting 91 the data into several the same-size pieces, the data was put from one piece each time on the test set 92 and the other data on the training set.  128 The data about lysine methylation sites were collected from several different sources (Fig 1) (S1 129 and S2 Tables) but mainly from PhosphoSitePlus in quantity. Then during the data cleaning, some 130 unreliable data was dropped, and followed by data organization with all lysine methylation sites 131 represented in uniformed format (Table 1).
132 133 Table 1. The meaning of header in the data table.

Column Meaning
Entry Primary (citable) accession number in UniProt.

Position
The position of the lysine in a protein and it starts from 1.

SeqWin
The peptide sequence with the lysine in the center and its length is fixed on 61.

Evidence
The Pubmed or CST ID corresponding to the site.

Sequence
Protein sequence.

134
135 Similar to the practice of many researchers in the past, we used 61 lengths of sequence windows 136 with "K"(lysine) in the center and other amino acids or the padding "_" representing no amino acid 137 in two sides and corresponding positions in corresponding proteins. The last step was redundancy 138 reducing because the uniformed data obtained in the last step may be duplicated depending on your 139 definition of the unique primary key or id and the duplicated data was meaningless to some extend.
140 In our computational experiment, the key or id was defined to be the sequence window because 141 which is the input of our model, corresponding to only one output in a format of a value or a series 142 of value namely a vector and all the inputs data of the testing set or training set or positive sample 143 set or negative sample set was unique. Note, we did not reduce redundancy below 100% similarity, 144 although many others claimed that the operation could avoid overfitting or improve the 145 performance of the model. If using Entry and Positon as the key of the data table, we can get a 146 summary of the data size from different sources ( Table 2). The PHP [33] is a constantly updated 147 database of several common PTMs such as phosphorylation, acetylation, ubiquitination, 148 methylation, etc (Fig 2). Although we collected sites from multiple sources, most sites can be found 149 in PHP and the specific types such as mono-, di-, and tri-methylation are labeled in most of the 150 sites. However, the key consisting of Entry and Position is not suitable for modeling, and using 151 different keys would determine different sample numbers (Table 3) and have a different function 152 (Table 4). So another key consisting of SeqWin is more appropriate to represent the sample set, and 153 other keys can be found in (Table 4). 159  Table).

Cross-validation Meaning Ability
First  Table).
215 The splitting approach of the testing set and training set in negative samples have only one. 216 We randomly chose 20,000 sites from the 638,805 negative sites as the training set, and the other 217 20,000 sites as the testing set. We don't use 10 fold cross-validation in negative sites is because the 218 number 20,000 is big enough meaning stable enough and the consideration of extra computational 219 cost. Details are summarized in Table 8.
220 Table 8. The positive and negative sample sizes in the training set and testing set. 235 Model 236 A neural network has far better fitting power than traditional machine learning models such as 237 SVM, Random forest, and so on. Not very strictly, it can fit nearly any function relation so that it is 238 easy to make the performance in trainset perfect, such as 100% accuracy both in negative and 239 positive samples. However, overfitting in the test set is the main or not strictly the only challenge 240 we face, for instance, perfect performance in trainset but bad in the test set. There are many 241 approaches to alleviate overfitting, such as restricting the weight through regularization to narrow 242 the decision space to a more simple relation basing on Occam's razor law-Entities should not be 243 multiplied unnecessarily. Here, we use multi-task learning in the computer science field to build our 244 model. On account of the three cases of lysine methylation including mono-, di-and tri-methylation 245 of lysine have different sample volumes with di-and tri-methylation very few samples, a multi-task 246 learning model was applied to share some weights of models for the three cases so that each of them 247 would benefit from it, just like a student learns multiple subjects simultaneously to improve the 248 grade of each subject because some knowledge is easier to be learned from this subject rather than 249 other subjects but can be used to other subjects. An early stopping strategy was applied. The 250 parameters are summarized in Table 9. The graph representation of the model is in Fig 4. 251 Table 9. Parameters of the multi-task deep learning model.

Results and Discussions
259 For the traditional method. 10 multi-task learning models were randomly trained in each fold 260 using the early stopping strategy, and the result showed that the mono-methylation has the best 261 average performance, followed by, in descending order, *-, tri-and di-methylation respectively (Fig   262 7) (Tables 10, 11 and 12). Note, the data for Km3 was less than Km2, suggesting the features of 263 Km3 were more evident than the features of Km2. In 283 mono-, di-, tri-, and *-methylation (* represent any of mono-, di-and tri-) respectively (Table 11) 284 (S6 Table). Considering the imbalance of different testing set sizes, we used sizes as weight and 285 weighted average AUC is 0.751, 0.652, 0.689, and 0.719 for lysine mono-, di-, tri-, and *-286 methylation respectively (Table 11). By contrast, the same model was used in the traditional testing 287 method and the average AUC is 0.846, 0.719, 0.780, and 0.817 for lysine mono-, di-, tri-, and *-288 methylation respectively (Table 10). Besides, we got nearly the same AUC rank of different 289 experiments in the experiment-split testing method using different models such as RF, indicating 290 that the rank was related mainly to the data itself rather than the model (Fig 8)