Floating search methodology for combining classification models for site recognition in DNA sequences

Recognition of the functional sites of genes, such as translation initiation sites, donor and acceptor splice sites and stop codons, is a relevant part of many current problems in bioinformatics. Recognition of the functional sites of genes is also a fundamental step in gene structure predictions in the most powerful programs. The best approaches to this type of recognition use sophisticated classifiers, such as support vector machines. However, with the rapid accumulation of sequence data, methods for combining many sources of evidence are necessary as it is unlikely that a single classifier can solve this type of problem with the best possible performance. A major issue is that the number of possible models to combine is large and the use of all of these models is impractical. In this paper, we present a framework that is based on floating search for combining as many classifiers as needed for the recognition of any functional sites of a gene. The methodology can be used for the recognition of translation initiation sites, donor and acceptor splice sites and stop codons. Furthermore, we can combine any number of classifiers that are trained on any species. The method is also scalable to large datasets, as is shown in experiments in which the whole human genome is used. The method is also applicable to other recognition tasks. We present experiments on the recognition of these four functional sites in the human genome, which is used as the target genome, and use another 20 species as sources of evidence. The proposed methodology shows significant improvement over state-of-the-art methods for use in a thorough evaluation process. The proposed method is also able to improve heuristic selection of species to be used as sources of evidence as the search finds the most useful datasets. Author summary In this paper we present a methodology for combining many sources of information to recognize some of the most important functional sites in a genomic sequence. The functional sites of the sequences, such as, translation start sites, translation initiation sites, acceptor and donor splice sites and stop codons, play a very relevant role in many Bioinformatics tasks. Their accurate recognition is an important task by itself and also as part of gene structure prediction programs. Our approach uses a methodology usually termed in Computer Science as “floating search”. This is a powerful heuristics applicable when the cost of evaluating each possible solution is high. The methodology is applied to the recognition of four different functional sites in the human genome using as additional sources of evidence the annotated genomes of other twenty different species. The results show an advantage of the proposed method and also challenge the standard assumption of using only genomes not very close and not very far from the human to improve the recognition of functional sites in the human genome.

The recognition of functional sites within the genome is one of the most important 2 problems in bioinformatics research. Determining where different functional sites, such 3 as promoters, translation start sites, translation initiation sites (TISs), donors, 4 acceptors and stop codons are located provides useful information for many tasks [1]. 5 For instance, the recognition of translation initiation sites, donors, acceptors and stop 6 codons [2] is one of the most critical tasks for gene structure prediction. 7 Many of the most successful gene recognizers that are currently in use implement 8 an initial step of site recognition [3], which is followed by a process of combining the 9 sites into meaningful gene structures. Accurate recognition is of the utmost 10 importance for the whole gene structure prediction process. Actual sites that are not 11 found by the classification models likely result in exons not being considered by the 12 remaining steps of the recognition program. Furthermore, many false positives might 13 inundate the second step, thereby making it difficult to predict gene structures 14 accurately. State-of-the-art approaches use powerful classifiers, such as support vector 15 machines (SVMs), and consider moderately large sequences around the functional site 16 of interest [2,[4][5][6]. 17 In recent years, information about the genomes of many species has been 18 accumulated. This information can be used to improve the recognition of functional 19 sites. However, the arbitrary selection of species using the widely assumed hypothesis 20 that we must consider moderately distant evolutionary relatives is clearly a 21 suboptimal procedure [7]. In addition, the classifier models are chosen a priori, 22 without considering the possible benefits of combining various models. 23 It would be more efficient to learn all of the available classification models and 24 obtain the best combination using an automatic method. The problem of finding the 25 best combination can be tackled as a search problem over all possible combinations. 26 An exhaustive search is unfeasible even for a small number of models. Other common 27 search heuristics, such as evolutionary computation and swarm intelligence, are also 28 prohibitively costly in terms of running time. 29 In cases when those heuristics cannot be used, floating search is an inexpensive yet 30 sufficiently powerful methodology that is able to achieve very good solutions. Floating 31 search has been used when the cost of each search step is high [8]. Thus, in this work, 32 we propose using floating search to obtain a near-optimal combination of classification 33 models, in which we can consider as many sources of evidence as are available and use 34 as many classifiers as needed using various floating search methods, namely, Sequential 35 Forward Selection, Sequential Backward Selection, Plus-l Minus-r Selection, 36 Sequential Forward Floating Selection, Sequential Backward Floating Selection, 37 Random Sequential Forward Floating Selection and Random Sequential Backward 38 Floating Selection. Although the first two methods are not actually floating search 39 methods but sequential greedy approaches, we included them for completeness. 40 To evaluate the proposed method, we show results for the recognition of the four 41 functional sites that are cited above in five chromosomes of the human genome. To

47
As stated in the introduction, our major aim is to develop a combination method for 48 obtaining optimal, or near-optimal, subsets of classification models that are trained for 49 site recognition in DNA sequences. An exhaustive search would require the evaluation 50 of 2 N − 1 combinations of models given a set of N trained classifiers. This type of 51 search is infeasible even for a small value of N . Therefore, we must use a search 52 algorithm to find the best possible model combination efficiently. Many powerful 53 metaheuristics are available in the machine learning literature, such as evolutionary 54 computation [9], particle swarm optimization [10], ant colonies [11] and differential 55 evolution [12]. However, all of these methodologies require the repetitive evaluation of 56 many solutions to achieve their optimization goal. In the problem of site recognition, 57 the evaluation of a possible solution is a costly process due to the large datasets that 58 are involved. Thus, these metaheuristics are not feasible. 59 Instead, we propose a simpler approach, namely, floating search, which has 60 obtained successful results in other research fields, such as feature selection [13][14][15][16]. 61 Floating search, which will be described in depth in the following section, is a set of 62 stepwise search methods that are fast and efficient at solving problems in which the 63 evaluation of many possible candidate solutions is too computationally expensive. 64 The process for obtaining the best combination of classifiers for various species is 65 composed of two steps: a training step and validation step. Before starting the 66 learning process, we need to obtain the training datasets, testing dataset and 67 validation dataset. Without a loss of generality and to provide the necessary focus for 68 our description, we use the same setup as in the experiments that are reported below. 69 We address the problem of site recognition in the human genome. To solve this 70 problem, we use a test set of sites of a specific chromosome, which we denote as T .

71
The training set includes all of the remaining human chromosomes and genomes of all 72 of the species we choose to evaluate. For validation, we use one of the human 73 chromosomes in the training set, which we denote as V and remove it from the 74 training set. 75 Floating search 76 As stated above, the use of complex heuristics for combining tens or hundreds of 77 models would incur an infeasible computational cost. Thus, we propose the use of 78 simpler, yet still powerful, heuristics. We state our problem as a search problem to 79 enable the application of those heuristics. We have N trained classifiers 80 C = {c 1 , c 2 , . . . , c N }, which are trained using any types of sequences that could be 81 useful, and use any genome that we consider interesting. Our aim is to obtain a subset 82 of classifiers C ′ ⊂ C that is the best possible combination. Evaluation of the 83 combination of models is carried out using cross-validation. Thus, our objective 84 function for maximization is the accuracy of the combination of classifiers over a 85 validation set V , which is denoted as J(V ).

86
Among the simplest methods, Sequential Forward Selection (SFS) [17] (see 3/54 J(V ) does not decrease. These two methods can be generalized to add or remove r ≥ 1 97 classifiers in every iteration. These methods are fast and can obtain good results, but 98 have two major problems: They easily become trapped in local minima and suffer 99 from the "nesting effect" [19]. The nesting effect means that to obtain an optimal 100 solution of size M , it must contain the optimal solution of size M − 1, which is not 101 often the case in practice.

Data
: A set of trained classifiers C = {c 1 , c 2 , . . . , c N } and a validation set V .

Result
: The selected subset of classifiers Copt ⊂ C.
break end while true 4 Return the best subset of classifiers Copt Algorithm 2 Sequential Backward Selection (SBS).

Data
: A set of trained classifiers C = {c 1 , c 2 , . . . , c N } and a validation set V .

Result
: The selected subset of classifiers Copt ⊂ C.
Return the best subset of classifiers Copt The nesting problem can be avoided using the Plus-l Minus-r Selection (LRS) 103 search method [20]. LRS adds backtracking capabilities by using SFS to add l models 104 and SBS to remove r models. However, one major problem is that there is no rule for 105 choosing the best values of l and r. The LRS method is shown as Algorithm 3.

106
A more advanced approach is floating search. In floating search, we let the size of 107 the solution "float" and adapt to the problem using a backtracking mechanism. In 108 that way, Sequential Forward Floating Selection (SFFS) and Sequential Backward 109 Floating Selection (SBFS) [8]  Somole et al. [13] proposed an adaptive version for feature selection in which the 118 number of models to add or remove was incremented when the desired number of removed model = true else break end end while added model ∨ removed model 10 Return the best subset of classifiers Copt Algorithm 4 Sequential Forward Floating Selection (SFFS).

Data
: A set of trained classifiers C = {c 1 , c 2 , . . . , c N } and a validation set V .

Result
: The selected subset of classifiers Copt ⊂ C.
removed model = true end while removed model while true 8 Return the best subset of classifiers Copt

Data
: A set of trained classifiers C = {c 1 , c 2 , . . . , c N } and a validation set V .

Result
: The selected subset of classifiers Copt ⊂ C.    As we are combining various models, there are many ways of combining the 130 outputs of those models. For the combination, we use three simple methods as our 131 major aim is efficient execution. Although there are more complex approaches [22], 132 their advantage is not large due to over-fitting problems. These methods are: i) the 133 sum of the outputs of the classifiers; ii) the majority voting; and iii) the maximum 134 output, where the sequence is classified using only the model with the highest output. 135 In the machine learning literature, combining different sources of evidence for a 136 classification problem is a common task [23]. Although various sophisticated methods 137 have been developed for combining many classifiers [24][25][26][27]; in practice, none of them 138 are able to significantly outperform the simpler methods on a regular basis.

139
Two of the problems of combining many different classification models that are 140 trained on different datasets are that their outputs may not be in the same range and 141 the optimal classification threshold might be different for each model. The problem of 142 the different ranges is solved by scaling all of the outputs to the interval [−1, 1].

143
Regarding the threshold, we obtain the optimal threshold for each model, which is 144 denoted as Θ opt , using the validation set, and for the inclusion of the model in any  For the training stage, we can select as many species as we deem useful for our 148 1 In our experiment, this subset was obtained selecting each classifier with a probability of 0.5. problem. We need not select the most appropriate species because the floating search 149 will discard the useless classifiers. Once we have selected the set of species whose 150 genomes we are going to use, we train as many classifiers as we want from those 151 species. For every organism, we can train various classifiers, such as support vector 152 machines (SVMs), neural networks (NNs), decision trees (DTs), and the k-Nearest 153 Neighbor (k-NN) rule, and the same classifiers with different parameters. Because the 154 validation stage can consider hundreds of classifiers, any method of potential interest 155 can be used. Again, the floating search process will remove unneeded classifiers. were selected to consider a wide variety of organisms whose genomes are fully 166 annotated.

167
Five classifiers were trained from every dataset for the four functional sites: a 168 decision tree, a k-nearest neighbor rule, a positional weight matrix, a support vector 169 machine with a string kernel and a support vector machine with a spectrum kernel.

170
Additionally, for TIS and stop codon recognition, we used the stop codon method [28]. 171 The parameters for every classifier were obtained using 10-fold cross-validation.

172
To evaluate our approach, we used five human chromosomes for testing purposes,  One of the key aspects of the evaluation of any newly proposed method is the set of 192 previous methods that are considered in the comparison. Many methods have been kernel with shifts [33] (WDS) and the spectrum kernel [34]. SVMs with WD kernels 202 consistently provided the best results. Thus, we chose this method as the method to 203 be compared with our proposed method. WDS provided marginally better results 204 than WD, but with a far higher computational complexity. To ensure a fair 205 comparison, we considered not only these methods but also all of the others that were 206 used as classifiers. Then, for every experiment, we compared our approach to the best 207 performing method in terms of the validation performance. SVM with WD kernel was 208 always the best individual classifier.

209
Another key parameter of the learning process is the window around the functional 210 site that is used to train the classifiers. An additional advantage of our approach is 211 that it allows the use of a suitable window for each dataset and even the combination 212 of models that are trained using different windows. The value of the window for each  To train the models, we used random undersampling [35] because previous studies 238 have demonstrated its usefulness for TIS recognition [31]. For random undersampling, 239 we used a ratio of 1, which means that the majority class was randomly undersampled 240 until both classes had the same number of instances. To avoid any contamination of 241 the experiments, for every training set, regardless of the species, we removed the genes 242 that were shared with the test chromosome for all the training datasets.

248
The geometric mean of these two measures, namely, G − mean = √ Sp · Sn, was our 249 first classification metric. As a second measure, we used the area under the receiver 250 operating characteristic (ROC) curve (auROC). However, auROC is independent of 251 the class ratios and can be less meaningful when we have very unbalanced datasets [6]. 252 In such cases, the area under the precision-recall curve (auPRC) can be used. The 253 recall measure is equivalent to the sensitivity measure that was defined above. The 254 precision (P) is given by: The auPRc measure is especially relevant if we are mainly interested in the positive 256 class. However, the auPRc measure can be very sensitive to subsampling. In our 257 results, we use all the positive and negative instances for each of the five tested 258 chromosomes; thus, no subsampling is used. This also yields small auPRC values. 259 We use these three metrics because they provide two views of the performance of  The recognition of sites is usually a first step within a larger task, such as a gene 265 structure prediction program. Therefore, depending on the subsequent steps, our focus 266 was centered on obtaining models that perform well in terms  the low values of auPRC for all methods. For G-mean, the results also showed a clear 311 advantage of our method with an improvement of over 5% for the worst case.

312
The reported reduction is relevant because most current gene recognizers heavily 313 rely on the classification of sites as a basic step; therefore, it is very likely that those 314 genes whose TIS is not recognized would be completely missed by any gene recognizer. 315 Our approach has the potential to significantly improve the accuracy of any         Ornithorhynchus anatinus. Second, G-mean and auROC optimization required more 381 models, whereas auPRC used significantly fewer models for TIS prediction. The 382 models that were selected for every optimized measure showed a large variety, thereby 383 supporting the claim of our work that as many genomes as available should be used 384 instead of selecting some of them a priori.   The models that were selected for every chromosome are shown in    approach improved the results by more than 4% and in the best case by more than 6%. 419 These improvements were also achieved for the auPRC and G-mean measures. The

431
The numbers of models and selected classifiers and genomes for every case are 432 shown in Table 9. G-mean, as in the previous results, was the measure that required 433 fewer models, from 2 for chromosome 13 to 6 for chromosomes 1, 3 and 21. auROC 434 selected from 7 to 15 models. Again, auPRC required a comparatively large number of 435 models, from 31 to 58 selected models.    Caenorhabditis elegans was the only genome that was never used. The use of all 437 the genomes was more balanced for the recognition of stop codons, using even 438 genomes that were far removed from the human genome, such as those of Takifugu   439 rubripes of Danio rerio. The use of classifiers was also more equally distributed among 440 the six methods, with the exception of PWM, which was never used.

441
With respect to the three objectives, optimizing the G-mean required fewer models, 442 from 2 to 6. For the five chromosomes, the SVM method for Macaca mulatta and Pan 443 troglodytes was always selected. Callithrix jacchus and Canis lupus familiaris were also 444 selected in most chromosomes. For auROC, more models were selected, from 7 to 15.

445
The SVM method for Macaca mulatta and Pan troglodytes was always chosen, but the 446 remaining methods depended on the chromosome. This is another interesting result 447 because most stop codon recognition programs rely on common models for any task.

448
Finally, for auPRC, significantly more models were selected, from 31 to 58, with a 449 significant variation among the chromosomes.        predictor that searched for exons using the sites that were found by the recognition 482 program, which was either the standard approach or our proposed method, and 483 constructed a gene using these exons. This simple program is not intended for gene 484 structure prediction but only to test the ability of our proposed method in improving 485 gene recognition.

486
To evaluate gene predictor performance over a test sequence, the predicted gene 487 structure is compared with the annotated gene structure on the target sequence. The where P P is the number of predicted positives, AP the actual positives, P N the 501 predicted negatives and AN the actual negatives. We also calculate the Average 502 Conditional Probability (ACP) measure: 503 ACP = 1 4 T P T P + F N + T P T P + F P + T N T N + F P + T N T N + F N , At the exon level, an exon is considered to have been correctly predicted when 505 both boundaries are correctly predicted. If a predicted exon contains at least one 506 actual base, it will be considered a partially correct exon. At the exon level, we show 507 Sp, Sn and the numbers of missed exons (ME), which are exons that are not found by 508 the program, and wrong exons (WE), which are predicted exons that do not 509 correspond to any actual exon. As a representative of our proposed method, we used 510 the model that was obtained when optimizing G-mean, as the previous section showed 511 that it achieved the best overall behavior.  In this paper, we presented a floating-search-based strategy for functional site 522 recognition in genomic sequences. The use of floating search enables an efficient search 523 for the best combination of more than a hundred of classification models that are 524 trained on the genomes of many species. The presented approach can also be used for 525 other combination tasks.

526
The proposed method also enabled the optimization of various performance 527 measures. In the reported experiments, we showed results on searching for the best 528 combination that optimizes three measures: auROC, auPRC and G-mean. The