Neural network and random forest models in protein function prediction

Over the past decade, the demand for automated protein function prediction has increased due to the volume of newly sequenced proteins. In this paper, we address the function prediction task by developing an ensemble system automatically assigning Gene Ontology (GO) terms to the given input protein sequence. We develop an ensemble system which combines the GO predictions made by random forest (RF) and neural network (NN) classifiers. Both RF and NN models rely on features derived from BLAST sequence alignments, taxonomy and protein signature analysis tools. In addition, we report on experiments with a NN model that directly analyzes the amino acid sequence as its sole input, using a convolutional layer. The Swiss-Prot database is used as the training and evaluation data. In the CAFA3 evaluation, which relies on experimental verification of the functional predictions, our submitted ensemble model demonstrates competitive performance ranking among top-10 best-performing systems out of over 100 submitted systems. In this paper, we evaluate and further improve the CAFA3-submitted system. Our machine learning models together with the data pre-processing and feature generation tools are publicly available as an open source software at https://github.com/TurkuNLP/CAFA3 Author summary Understanding the role and function of proteins in biological processes is fundamental for new biological discoveries. Whereas modern sequencing methods have led to a rapid growth of protein databases, the function of these sequences is often unknown and expensive to determine experimentally. This has spurred a lot of interest in predictive modelling of protein functions. We develop a machine learning system for annotating protein sequences with functional definitions selected from a vast set of predefined functions. The approach is based on a combination of neural network and random forest classifiers with features covering structural and taxonomic properties and sequence similarity. The system is thoroughly evaluated on a large set of manually curated functional annotations and shows competitive performance in comparison to other suggested approaches. We also analyze the predictions for different functional annotation and taxonomy categories and measure the importance of different features for the task. This analysis reveals that the system is particularly efficient for bacterial protein sequences.

Proteins play a pivotal role in many processes of living organisms, including, but not 2 limited to, signal transduction, transmembrane transport and structural support. 3 Determining protein functions experimentally is an expensive and labor-intensive 4 undertaking. With the increasing number of sequences produced by high throughput 5 sequencing methods, there is an urgent need for computational methods to assist in 6 protein function annotation. Over the past decade, a research community focusing on 7 automated function prediction (AFP) has formed, resulting in a number of AFP systems 8 and the regular Critical Assessment of Functional Annotation (CAFA) challenge. 9 CAFA is a shared task organized by the AFP-Special Interest Group, aiming to 10 establish a common platform and evaluation methods for measuring the performance of 11 automated systems for the AFP task [1,2]. In CAFA, each participating research group 12 is asked to predict the functions for a large set of proteins with more than 100,000 13 individual sequences, under a tight time limit. Subsequently, the organizers contract 14 laboratories to experimentally verify functional annotations for a subset of the 15 sequences, usually within half a year after the predictions were submitted. The resulting 16 verified annotations constitute a new test set, against which the predictions made by 17 the participants can be evaluated. 18 In CAFA, the protein functions are assigned from the controlled vocabulary of terms 19 defined in Gene Ontology (GO), i.e. the task is to annotate the given protein sequences 20 with the relevant GO terms that describe their function. In addition to molecular 21 functions (MF), GO includes terms related to the cellular components (CC) in which 22 the proteins are active, and the biological processes (BP) such as pathways in which the 23 proteins participate [3,4]. All together, over 40,000 terms, organized in a hierarchy, exist 24 in the current version of GO. This translates into a multi-class and multi-label 25 classification task, where each protein sequence can be annotated with multiple terms 26 from this large vocabulary, with strong statistical dependencies between the terms. 27 Further, the distribution of GO terms is highly skewed in several distinct ways, as 28 demonstrated in Swiss-Prot, a subset of the UniProt protein database [5] manually 29 curated with GO terms. While for instance the human proteins are densely annotated 30 with 20 GO terms on average, full 36% of GO terms do not have a single annotated 31 example, and 18% have only one. This means that a proportionally small number of 32 unique GO terms account for a large proportion of the annotations. All these factors 33 combined make AFP a very challenging task from the machine learning perspective. 34 The first two CAFA challenges have seen a variety of approaches applied to the 35 problem [1,2]. Among the top performing systems, the most common approach was the 36 annotation transfer by homology, combined with a statistical or machine-learned scoring 37 function. Cozzetto et al. [6] use a scoring function to combine and rank the predictions 38 from various biological data analyses, including PSI-BLAST [7], profile-profile similarity search and the HMMER [11] tools. In CAFA2, GO-FDR [12], one of the best 47 performing systems on all three GO ontologies, calculates the probability of a protein 48 being associated with a target GO term, using predictions from the PSI-BLAST tool. 49 Recently, You et al. [13] suggested an approach based on an ensemble of logistic 50 regression models which, resulted in the best overall performance among the 51 participating teams in the CAFA3 challenge. For each GO term, a set of three logistic 52 regression models are independently trained based on structural information from 53 InterPro [14], biophysical attributes from ProFET [15] and amino acid n-gram features. 54 Sequence alignment and GO annotation frequencies are used as additional features. All 55 this information is aggregated by a separate machine learning (ML) model in a 56 learning-to-rank setting.

57
Deep neural network architectures have been successfully applied to bioinformatics 58 problems, such as fold recognition, functional classification and protein design [16,17].

59
For the protein function prediction task, several architectures have been developed, 60 aiming to replace hand-crafted features with ones directly extracted from the sequences. 61 DeepGO [18] uses a deep convolutional neural network (CNN) architecture composed of 62 10 hidden layers, with inputs formed from amino acid tri-grams and the embedding 63 (continuous vector representation) of the protein induced using its neighborhood in the 64 STRING [19] database graph. ProLanGO [20] uses a neural machine translation system 65 based on a recurrent neural network to "translate" proteins into the corresponding GO 66 terms. Here the protein sequences are first converted to non-overlapping amino acid 67 k-mers of length 3 to 5, forming the input sequence for the model. Rui et al. [21] use 68 multi-task deep neural networks with features directly derived from the amino acid 69 sequences as well as external analysis tools, focusing on human proteins only. In general, 70 these deep neural network models have demonstrated a competitive performance, on par 71 with other machine learning approaches with less feature engineering effort.

72
Random forests, an ensemble of decision trees, have been used in a wide array of 73 prediction tasks, including post-translational modification site prediction [22], fold 74 recognition [23], SCOP structural classification [24], protein-protein interaction site 75 prediction [25], and enzyme function classification [26]. Most models derive 76 protein-related features, such as hydrophobicity, secondary structure, and amino acid 77 composition, both by directly extracting the features from the sequence and by relying 78 on established sequence analysis tools. Random forests have been a popular machine 79 learning method in bioinformatics due to its simplicity in terms of modeling and 80 interpretability of the results through feature importance analysis.

81
In this paper, we introduce our system based on neural network and random forest 82 classifiers, both relying on a rich set of features and, individually, achieving competitive 83 performance. Further, through an evaluation on our internal test dataset, we show that 84 these individual approaches strengthen each others performance, resulting in an  The official CAFA3 evaluation places our ensemble system in top-10 overall out of 68 93 participating teams and 144 submitted systems, with particularly strong performance 94 on molecular function and cellular component categories of prokaryotic proteins, where 95 the system placed 3rd and 2nd [27].

97
In this section we describe the protein datasets, sequence analyses used for generating 98 features and the details of the neural network and random forest classifiers and the 99 ensemble systems. 100 Protein dataset 101 As the training data, we selected only those Swiss-Prot proteins, whose function 102 annotation is based on a reliable experimental evidence, which we define as evidence 103 codes EXP, IDA, IPI, IMP, IGI, IEP, the author statement code TAS, or the 104 curatorial statement evidence code IC being assigned to the annotation. This restriction 105 aims to discard annotations stemming from high-throughput and other noise-prone  In addition to the exact GO terms manually assigned to a particular protein, we In this section, we summarize the sequence-based features as used by the various 116 classifiers throughout the paper.

117
BLAST features 118 We use the Protein-Protein BLAST (BLASTP) program from a locally installed 119 NCBI-BLAST+ version 2.5.0 [28] to search for similar proteins from the full Swiss-Prot 120 database. The E-value of 0.001 is used as the inclusion threshold. We query with both 121 BLOSUM45 and BLOSUM62 [29] scoring matrices, resulting in two distinct sets of 122 similar proteins for each query sequence. These will be referred to as blast45 and 123 blast62 hereafter. We use the default gap scores for each of the scoring matrices. 124 We also use DELTA-BLAST (Domain Enhanced Lookup Time Accelerated) for 125 searching distantly related protein sequences [30]. DELTA-BLAST increases the 126 sensitivity of BLAST-based sequence similarity search by constructing position-specific 127 score matrices from the conserved domain database (CDD) [31]. BLOSUM62 is used as 128 the scoring matrix with 0.001 as the cut-off E-value.

129
All BLAST data was converted to feature vectors with the following approach: The 130 Uniprot ID of the matched protein was used as the feature name and the HSP 131 (High-scoring Segment Pair) score as its value. InterPro profiles and, subsequently, GO terms for those cases where a mapping between 137 an InterPro profile and GO is established in the database. These mappings are however 138 neither up-to-date nor complete. To increase the coverage of the mappings, we trace the 139 patterns and signatures to InterPro's upstream databases, and recover the GO terms 140 from there. Features are generated from the matching GO terms (or a special feature is 141 produced, signalling the absence of such mapping) such that scores given by 142 InterProScan are converted to numerical features while the GO terms and profile 143 accession identifiers are converted to binary features.

Taxonomy features 145
The NCBI Taxonomy is a manually curated database of names and taxonomic lineages 146 for organisms within the scope of the International Nucleotide Sequence Database 147 Collaboration (INSDC) [33]. A binary feature is generated for each node in the NCBI

148
Taxonomy and subsequently assigned to each protein based on its organism of origin.

149
Sequence features 150 We use several additional tools to analyze the protein sequence and provide features 151 potentially relevant to its function.

152
For nuclear localization, we use NucPred [34] to predict whether a protein enters the 153 nuclear compartment. This analysis applies eukaryotic sequences only, as prokaryotes do 154 not have a nucleus. For post-translational modifications, we use NetAcet [35] to predict 155 proteins which are acetylated by N-acetyltransferase A (NatA). For GPI-anchored 156 proteins, we use PredGPI [36] which is based on a support vector machine and a Hidden 157 Markov Model predicting the anchoring signal and the most probable omega-site. The 158 predictions of these three tools are encoded as numeric features.

159
Amino acid index 160 The amino acid index is a set of numerical values characterizing the physicochemical 161 and biochemical properties of each of the 20 amino acids. It has been used for numerous 162 structure-function prediction tasks, e.g. human protein sub-cellular localization [37]. In 163 this work, we obtain 544 amino acid indices from the Amino Acid Index database [38]   In all experiments the input data for each protein was converted into numerical feature 168 vectors and the GO terms into binary label vectors. We subsequently randomly divide 169 the training data into three parts: The training set used for training the classifiers, the 170 validation set used for hyperparameter optimization and the test set which is used for 171 the final performance estimation. The validation set is also used during system 172 development and for experiments, including feature selection, in order to avoid 173 overfitting the test set. The training, validation and test subsets contain 60%, 20% and 174 20% of the whole data.
175 Different feature groups were tested in combination to select the ones that gave the 176 best classification performance. Finally, for computational reasons, the classifiers are 177 trained to predict the 5,000 most common terms, a subset of GO which covers over 94% 178 of all GO annotations in Swiss-Prot. Note that in testing, the classifiers are naturally 179 evaluated on the full set of GO terms, thereby the 5,000 term subsetting choice does not 180 artificially overestimate the performance. 181 We evaluate the performance of the systems using the F-score, unlike in the official 182 CAFA evaluation, where the maximal F-score, calculated from precision-recall curves, is 183 used as the primary metric. Thus, our internal evaluation is more strict and acts as a classification performance and support multi-label classification on thousands of labels. 213 The classifier was implemented using scikit-learn library version 0.18.1 [41].

214
Convolutional neural network 215 While the previous two methods can be seen as traditional classifiers relying on a 216 carefully selected set of features, the third method, convolutional neural networks, will 217 depart from this paradigm in not being presented with any complex features. Rather, 218 the neural network is presented with the sequence itself as its sole input.

219
Convolutional neural networks (CNNs) have been successfully applied to sequential 220 and multidimensional prediction tasks, especially in computer vision and text 221 classification [42,43]. In biological sequence analysis, they have been applied specifically 222 to protein secondary structure prediction [44]. The strength of CNNs arises from the 223 possibility of detecting fuzzy patterns locally from the input data, e.g. in our case we 224 expect the convolutional kernels learn to detect amino acid n-grams relevant to a 225 certain protein function. Many suggested systems for AFP rely on structural 226 information such as the presence of a certain domain or motif. However, this type of 227 information is mostly gathered from other tools such as InterProScan, or by simply 228 looking at the amino acid n-grams present in a given sequence, resulting in extremely 229 sparse feature representations [18]. To overcome this issue, our neural network learns a 230 latent feature vector (embedding) for each unique amino acid, and a protein is leading to similar embeddings. To ease the task of learning these embeddings, we 234 attempted to initialize the weights with the Amino Acid Index properties, but this did 235 not have a significant influence on the model performance. 236 We use convolution window sizes of 3, 9, 27 and 81 amino acids and learn 50 kernels 237 for each window size. The shortest window size of 3 is a common length used in amino 238 acid n-gram features, whereas size 9 approximates the length of local segments analyzed 239 in protein secondary structure prediction [44]. The larger window sizes of 27 and 81 240 should in turn be able to detect motifs and shorter domains [45]. As the final, combined output of the abovementioned methods, we take the union of 268 their predictions. We evaluate all the model combinations and use the subset of models 269 that leads to the best performance measured in terms of F-score as the final system. For 270 completeness, we will also report on results obtained using the intersection of the

274
In this section, we evaluate the methods from several distinct angles as well as report on 275 the ranking of the ensemble system in the CAFA3 challenge.

Feature group selection 277
Feature selection is a process of selecting a subset of relevant features, or removing 278 irrelevant features, in order to simplify the system, reduce the training time and improve 279 the generalization of the models [47]. We performed feature selection by repeatedly 280 removing one feature group at a time, so as to increase the overall performance on the 281 development dataset, stopping when the performance of the classifier no longer 282 increased (see S1 Table for detailed results). The optimal performance using the random 283 forest classifier on the development data was achieved by removing the blast62, netacet 284 and nucpred feature groups. The same feature subset is used also for the neural model. 285 Among the remaining features, taxonomy contributes the most to the system 286 performance, i.e. removing taxonomy features results in a substantial drop in F-score.  Table notes All ensemble combinations of the three methods: random forest classifier (RF), feedforward neural network (FFN), and the homology transfer (HT), using either intersection AND or union OR of the predictions. The performance is evaluated as the micro-averaged F-score (F), precision (P) and recall (R).
The performances of all the approaches, feedforward neural network (FFN),  Table 1. The performance of all classifiers, except for CNN, 300 surpass HT that is based solely on inferring known protein functions from similar 301 sequences. FFN is the best performing method with an F-score of 0.480, followed by RF 302 with 0.424.

303
The union of the predictions of the individual methods outperforms each method 304 individually. The best result in terms of F-score is achieved by combining the predictions 305 of FFN and RF. This demonstrates that the classifiers, even though provided with the 306 same features, learn different aspects of the task. Also, as the RF classifier performs at 307 the high precision -low recall point, it adds a small number of predictions, which are on 308 average more likely correct than FFN, thereby benefiting the overall numerically 309 stronger FFN method. Even though adding the predictions from HT improves F-score 310 of both RF (+4.7pp) and FFN (+0.7pp) methods in isolation, it no longer improves the 311 F-score of the RF-FFN ensemble. As expected, the intersection-based ensemble 312 produces numerically inferior F-score as it drastically decreases the already low recall of 313 the models. Nevertheless, this low recall is matched with a comparatively high precision, 314 which could be a desirable property in some applications.

315
Finally, the CNN method has the lowest performance of the four methods in 316 isolation, and also decreased the performance of all tested ensembles. These are 317 therefore excluded from the The ultimate goal for automated function prediction is to develop universal and reliable 337 algorithms capable of predicting the function of any protein sequence. This is of course 338 a very difficult task. As shown in all CAFA challenges [1,2,27] the accuracy of the 339 predictions varies greatly among the organisms and different ontologies. To look beyond 340 overall system performance, we next focus on the results of our methods on these two 341 aspects: ontologies and organisms.

342
There are 10 organisms with over 1,000 manually annotated sequences, hereafter Considering the different ontologies, predicting biological processes remains a 361 challenge for the methods, compared to cellular component and molecular function. As 362 shown in Fig 3, all of the ensembles exhibit the highest performance when predicting 363 cellular components, followed by molecular function and biological process, a trend 364 common to many AFP methods [2,13,18]. The difficulty of predicting biological process 365 terms is probably due to the fact that its terms are the least correlated with sequence 366 similarity [48,49]. Thus using only sequence similarity to infer functions can be 367 insufficient or misleading, e.g. paralogs which occur from evolutionary gene duplication 368 processes are often recruited to different pathways [50].  We have presented methods for automated protein function prediction, evaluated both 380 on Swiss-Prot and also through the CAFA3 challenge. These methods have 381 demonstrated competitive performance among more than 100 CAFA3 entries, especially 382 on the prediction of prokaryotic molecular functions and cellular components.

383
Nevertheless, the absolute performance of the AFP methods show that the task remains 384 a challenge. 385 We can chart two main directions for further development. Firstly, features derived 386 from sequences related in other ways than solely through homology, e.g. through 387 co-expression or binding, can be potentially beneficial especially for the prediction of 388 biological processes, as demonstrated for instance by Piovesan et al. [51] and Kulmanov 389 et al. [18]. Of the three GO ontologies, biological process currently exhibits the lowest 390 absolute performance, and therefore is the most impactful target for further 391 development.

392
Secondly, structure and sequence-based features without doubt play an important 393 role in determining the function of a protein. However, having to employ the large 394 number of external tools needed to obtain the relevant features is a surprisingly tedious 395 task. As a potential remedy to this practical problem, but also as a research task in its 396 own right, we experimented with using a CNN to derive features directly from the well-performing neural model is non-trivial. As future work, we plan to improve our 405 methods by testing other neural network architectures [18,21] and by pretraining the 406 used protein sequence encoders in a similar fashion as is common in neural computer 407 vision and natural language processing systems [52,53] Table. Feature group selection. Feature groups are selected by removing one 412 more at each round (RM). Performance is shown as micro-averaged F-score (F), 413 precision (P) and recall (R), evaluated for the top-5000 terms. After removing three 414 groups performance no longer increases. Performance is measured on the optimization 415 set. Performance on the test set also increases from 0.411 for not being over-fitted on 416 the optimization set.

417
S2 Table. Ensemble system performance. The ensemble system is made by 418 combining the predictions from the two machine learning systems (RF = random forest 419 classifier and FFN = feedforward neural network) and the homology transfer (HT) with 420 either boolean AND or boolean OR. Performance is shown as micro-averaged F-score 421 (F), precision (P) and recall (R). F ∆ shows the difference to the best performing 422 system combination measured in percentage points.