Evaluation of Methods for Protein Representation Learning: A Quantitative Analysis

Data-centric approaches have been utilized to develop predictive methods for elucidating uncharacterized aspects of proteins such as their functions, biophysical properties, subcellular locations and interactions. However, studies indicate that the performance of these methods should be further improved to effectively solve complex problems in biomedicine and biotechnology. A data representation method can be defined as an algorithm that calculates numerical feature vectors for samples in a dataset, to be later used in quantitative modelling tasks. Data representation learning methods do this by training and using a model that employs statistical and machine/deep learning algorithms. These novel methods mostly take inspiration from the data-driven language models that have yielded ground-breaking improvements in the field of natural language processing. Lately, these learned data representations have been applied to the field of protein informatics and have displayed highly promising results in terms of extracting complex traits of proteins regarding sequence-structure-function relations. In this study, we conducted a detailed investigation over protein representation learning methods, by first categorizing and explaining each approach, and then conducting benchmark analyses on; (i) inferring semantic similarities between proteins, (ii) predicting ontology-based protein functions, and (iii) classifying drug target protein families. We examine the advantages and disadvantages of each representation approach over the benchmark results. Finally, we discuss current challenges and suggest future directions. We believe the conclusions of this study will help researchers in applying machine/deep learning-based representation techniques on protein data for various types of predictive tasks. Furthermore, we hope it will demonstrate the potential of machine learning-based data representations for protein science and inspire the development of novel methods/tools to be utilized in the fields of biomedicine and biotechnology.


Introduction
Protein informatics, which can be defined as the use of molecular modelling and data-driven computational methods (e.g., machine learning, statistical modelling) on the proteome to create efficient and scalable solutions, is increasingly becoming an active field of research. The computational methods to come out of such research have the potential to impact daily life through the fields of biomedicine and biotechnology, which have a current market size of 417 Billion USD and is expected to reach 729 Billion USD in 2025 † . Functional annotation of proteins is critical for protein informatics, as they are the primary inputs to said computational methods. As of October 2020, there are around 189 million protein entries in the UniProt knowledgebase; however, only 0.56 million (around 0.3%) of them are manually reviewed and annotated by expert curators, indicating a large gap between the current sequencing and annotation capacities. This gap is mainly due to the cost and time intensive nature of in vitro and in vivo experiments and the manual curation of their results. To supplement experimental and curation-based annotation, automated in silico approaches are being used. In this context, many research groups have been working on developing new computational methods to predict proteins' enzymatic activities 1,2,3 , biophysical properties 4,5,6 , interactions 7 , 3-D structures 8,9,10 , and ultimately, their functions 11,12,13 .
Protein function prediction (PFP) is the assignment of semantic meaning (i.e., functional definitions) to proteins, automatically or semi-automatically. The primary terminology for the functions of biomolecules are codified in the Gene Ontology (GO), a hierarchical network of concepts that annotate molecular functions of genes and proteins, as well as their subcellular localizations and the biological processes in which they are involved 14 . The most comprehensive benchmark project for PFP is the Critical Assessment of Functional Annotation (CAFA) challenge 15 , in which participants predict GO-based functional associations for target proteins.
CAFA challenges so far indicate that PFP is still an open problem.
Protein representation learning method development is a new but highly active area of research, and it mostly gets its technical inspiration from approaches proposed for natural language processing (NLP). It is shown in the literature that various protein representation learning methods, especially the ones that incorporate deep learning, have been successful at extracting relevant inherent features of proteins (Table 1). However, there is no comparative study to systematically evaluate the performance of these methods via quantitative benchmarks, in the context of artificially learning the functional aspects/properties of proteins. Nevertheless, gaining knowledge on complex relationships between sequences, structures, interactions and functions of proteins is especially critical for proposing novel solutions to current biomedical and biotechnological problems.
In this study, we conduct an investigation of the available protein representation learning methods that were proposed since 2015, with a detailed benchmark analysis regarding the potential of these methods to capture the functional properties of proteins. We explain both classical and learning-based methods to provide insight into their respective approaches to represent proteins, and we classify these methods according to their technical aspects and objectives (please see Methods section and the supplementary information document). Aiming to evaluate how much each representation model captures different facets of functional information, we constructed and applied benchmarks based on; (i) semantic similarity inference between proteins, (ii) ontologybased protein function prediction, and (iii) drug-target protein family classification (Results section). Finally, we discuss the results and current issues, and provide a perspective on the future of learned protein representations (Discussion section). The whole study is summarized in Fig. 1a. We expect that the discussion and conclusions of this study will inform researchers who would like to apply machine/deep learning-based representation techniques on biomolecular data for predictive modelling. We believe our investigation will be valuable in gaining insight about the potential of machine learning-based protein representation models in retrieving complex functional relationships of proteins, since previous studies mostly evaluated a few methods over tasks related to the structural features 23,26 . Finally, we hope this study will inspire new ideas for the development of novel, sophisticated and robust approaches to solve open problems in protein informatics.

Results
In this section, we focus on protein representation benchmark analyses. The review of the literature, including the construction and application of protein representations (Fig. 1b), and their technical and objective-based classification and evaluation (Fig. 2) are given in the Methods section and in the supplementary information document.
Nine different representation learning methods have been selected for our functional propertybased benchmarks, according to their previously reported success in predictive tasks, and subject to their availability as open access tools or ready to use pre-constructed feature vectors.
During the selection process, we also considered the source protein features/attributes used to train these methods (e.g., sequence, PPIs, etc.) and the algorithmic approaches, with the aim of covering a wide variety of methodologies. The methods included in the benchmark are thus;  Table 1.

Semantic Similarity Inference
In this analysis, we aim to measure how much information about biomolecular functional similarity is captured by the representation models. In this context, Gene ontology (GO) annotations are utilized, which signify the molecular functions, large-scale biological roles, and subcellular localizations of proteins. We first calculated vector similarities, which are defined in terms of pairwise quantitative similarities (e.g., cosine, Manhattan and Euclidean) between representation vectors of proteins in our dataset. These similarities were then compared to the ground truth best performer in the GO biological process (BP) except for the "all proteins" dataset, and TCGA_Embedding achieved the highest correlation score in the GO cellular component (CC) category. Mut2Vec is also among the top three methods in MF and CC categories.
TCGA_Embedding is also in the top three in the GO BP category. BERT, Gene2Vec, and LearnedEmbeddingVec are other notable methods but did not achieve the top performance in any of the categories.
UniRep 23 is based on multiplicative LSTM 35 . We propose two reasons to explain the higher performance of UniRep. The first reason can be that, UniRep constructs and uses a training sequence dataset with low bias (consisting of 24 million UniRef50 36 protein entries, which are filtered by a 50% similarity threshold from the from UniProtKB, instead of using all available protein sequences in the data source), which might be providing better generalization capability.
The second reason may be the size and the information content of the model. In our benchmarks, we employed the "UniRep Fusion'' model since this version had the highest performance according to the original UniRep study. This model was built with the concatenation of the "final hidden state", "final cell state", and "average hidden state" of the LSTM model, each of which has a size of 1x1900, providing a total vector size of 5700. In our opinion, the concatenation of the different states might have enhanced the protein representation vector with different levels of semantic information, since the level and type of information learnt at each layer is claimed to be distinct 16 .
Mut2Vec 29 was originally developed to predict the effects of mutations, but surprisingly it performed very well in our analysis considering the BP based semantic similarities. The model was developed using patient mutation profiles, biomedical literature and protein-protein interactions. The last two datasets may include information considering the role of the proteins in BPs. For example, considering that two proteins are interacting, then observing them as a part of the same BP is highly probable. Similarly, supposing two proteins had a role in the same BP, they may frequently be observed together in the same text (e.g., article). As a result, those proteins would probably be embedded proximally in the vectorial semantic space. We are suggesting that the top performance of Mut2Vec probably depends on these factors. TCGA_Embedding 31 exploited gene expression data and a simple learning system inspired by non-negative matrix factorization. With this approach, the authors constructed a representation with a vector size of 50, which is one of the smallest representations in our benchmark.
TCGA_Embedding scored the best performance in the prediction of CC-based semantic similarities, together with a notable performance considering the BP-based similarities. Similar to TCGA_Embedding, the Gene2Vec 30 model utilizes gene co-expression data with the skip-gram algorithm. The model performed well in both BP and CC based semantic similarity inference tasks. Gene (co)expression profiles are one of the least studied data types for developing protein representations as only a few studies exist; however, it is found to be quite informative to infer the similarities between proteins in terms of the BPs they take part in and the CCs that they localize to. It was reported in the literature that there is a correlation between the expression profiles and the subcellular locations of genes/proteins, which was evaluated in the context of machine learning-based prediction of protein localizations 37,38 . It was also discussed in the results of the CAFA Pi challenge (over the bacterial motility and biofilm formation biological processes) that gene expression is a critical input data type for predicting the biological roles of proteins 15 . TAPE-BERT 23 had the second place in the MF-based similarity inference task. TAPE-BERT (a bidirectional transformer) is the only method that uses the self-attention mechanism in this analysis.
Technical details of BERT and the self-attention are discussed elsewhere 39,40 . Similar to the success of the BERT model in word sequences (sentences) 41 , TAPE-BERT model can represent protein sequences with high accuracy. It should also be noted that the implementation we used here was directly obtained from the TAPE benchmark study 23 without any fine-tuning. TAPE-BERT may perform better in protein informatics tasks with further optimization.
Finally, LearnedEmbeddingVec 27 , which is a simple model that takes protein sequences at the input level to process them using doc2vec 42 , scored as good as the TAPE-BERT 23 models in our semantic similarity based analysis. The size of the TAPE-BERT models were notably larger compared to LearnedEmbeddingVec 27 (i.e., 12 hidden layers with 768 neurons for each layer, as opposed to 1 hidden layer with 64 neurons). We argue that some of the shallow models still preserve their significance, especially in MF-based semantic similarity inference.

Ontology-based Protein Function Prediction
As the second benchmark of our study, we aimed to assess the success of protein representation models in terms of automated protein function prediction (PFP). In this analysis, Gene Ontology 14 (GO) term annotations of proteins were used to train and test the same 11 protein representation models via supervised machine learning based classification. In this benchmark, we preferred to use a linear classifier (i.e., linear support vector classification from scikit-learn 43 ) in order to prevent non-linear transformations on the protein representation vectors, since the relevant information hidden in proteins should have been captured and extracted by the representation model beforehand, if the model is successful. This way, we could evaluate the protein representation models in terms of their success in extracting this information without additional factors.
Here, we also discussed a critical topic that was mostly overlooked in previous PFP studies, the assessment of the performance in terms of annotated GO term specificity. This is important since there is a relation between the specificity of a GO term (i.e., its location of the graph of GO) and its informativeness. For example, considering an annotation with the GO term "negative regulation of molecular function" the provided information is too general. These GO terms are generally located near the root of the GO graph and called "shallow terms". If the same protein was annotated with the GO term "negative regulation of double-stranded telomeric DNA binding", which is a descendent term of the previous one, the annotation would have been more informative. In order to take this phenomenon into account in our analysis, we simply grouped GO terms under three categories as shallow, normal, and specific; according to their level on the directed acyclic graph of GO (see Methods).
One key problem in applying deep learning to protein informatics is the requirement for high amounts of training data 11 . To examine this issue in our benchmark, we considered a grouping among GO terms regarding the number of proteins they are annotated to. This approach is expected to uncover the performance of representation learning models in terms of learning with only a few training examples, which also is the case for a considerable number of informative GO terms. Furthermore, some of the protein functions are studied well and some others are understudied, which creates a discrepancy in terms of the number of annotated proteins. We expect that this approach will be useful to assess the representations considering their ability to learn under-studied functional properties. For this, we created three categories that point out the number of the proteins annotated to a GO term as; low, middle, and high (see Methods).
PFP performance results are given for 9 different GO groups using F1-score based heat maps in Fig. 4. The overall GO term prediction performance results (averaged over 9 different GO groups) in terms of recall, precision, F1-score, accuracy, and Hamming distance are given in Table S4. It is important to mention that these performances are higher compared to the results of CAFA challenges, due to the way we modelled the experiment. We only run a test sample on the model that contains its true label as one of the 5 tasks (i.e., GO terms), instead of running all test samples on all prediction models. The reason behind this experimental design choice was to prevent the accumulation of the scores of all benchmarked methods in low performance regions (especially for hard-to-predict ontologies such as BP), which would prevent the observation of the performance differences in-between.
It is shown in both Fig. 4 and Table S4 that, in the MF prediction task, top methods showed similar performances across almost all GO groups (e.g., low, high, specific, shallow, etc.), among which SeqVec 28 got the top place, and k-sep-bigrams 33 , TAPE-BERT 23 models and UniRep 44 came right after. For BP and CC prediction tasks, SeqVec was the best performer together with k-sep-bigrams, and runner ups were again TAPE-BERT models and UniRep. We also observed that these methods are clustered together in all three heat maps, considering their performances (Fig. 4). These four representation learning methods share common characteristics that can explain their similar performance in the PFP benchmark. First of all, they are all based on large state-of-the-art sequence modelling algorithms: LSTM (SeqVec and UniRep with 93M and 18.2M parameters, respectively) and a Transformer (TAPE-BERT with 110M parameters). Also, they share the same model training objective, prediction of the next amino acid on the sequence.
Finally, they were all trained with large datasets (24M sequences for UniRep, 33M for SeqVec and 31M domain sequences for TAPE-BERT).
We also calculated model performances averaged for each GO group (Table S5), especially to observe the scores for challenging groups; the "low" group which contain GO terms that have low effective number of proteins (during dataset preparation, we eliminated highly similar proteins by filtering through UniRef clusters, to be able to take the effective number of samples into account while grouping GO terms), and the "specific" group consisting of GO terms that are leaf nodes, or are close to the leaf nodes, in the GO hierarchy. In Table S5, mean F1-score results (considering the average of all methods) indicate that low number of samples is a problem for BP and CC categories, but not so much for MF category, where there is generally an explicit relation between the input (i.e., sequence) and the label. On the other hand, we could not observe a trend in performance change in terms of GO term specificity. As a result, it can be stated that prediction success for informative specific terms may be solely related to the effective number of training proteins. For the MF-low category, k-sep-bigrams (F1:0.916) and SeqVec (F1:0.914) achieved the best performances. In the BP-low and CC-low categories, k-sep-bigrams had the best F1scores with 0.548 and 0.556, respectively. For the MF-specific and BP-specific categories, the SeqVec model got the top scores with F1:0.945 and F1:0.732, respectively. Considering the CCspecific category, the k-sep-bigrams model again got first place with F1:0.552. These results showed that, for the tasks where the number of labelled data points are low, the representation capability of classical (model-driven) methods is still higher compared to learning-based (datadriven) models.
Overall, the best performing method in the protein function prediction benchmark, considering all GO categories, was SeqVec 28 , whereas, k-sep-bigrams 33 , a classical protein representation based on evolutionary relationships, got second place with scores close to the top performer.

Drug Target Protein Family Classification
In our third benchmark analysis, we aimed to measure the performance of protein representations in the framework of drug discovery, with the prediction of drug target proteins' main families (i.e., enzymes, membrane receptors, transcription factors, ion channels and others). Since these families are made up of proteins with distinct structural characteristics, this benchmark analysis will also reflect the ability of these representations in learning structural properties. Furthermore, by using a data source other than GO annotations, we seek to diversify our benchmark and to evaluate the representations from a different perspective. Similar to the ontology-based protein function prediction benchmark, here we preferred to use a multi-task linear SVM classifier in order to solely measure the ability of learned protein representations in extracting the complex protein attributes/properties.
The number of samples in this analysis was lower compared to the previous benchmarks (Table   S6), as we only used human target proteins listed in the ChEMBL database 45 . Since there is an imbalance in terms of the number samples for each task (i.e., protein family), MCC was taken as the most reliable indicator in comparing the representation methods. According to the average cross-validation results of our multi-task classification model (Fig. 5), SeqVec 28 model is the best method in all metrics, whereas TAPE-BERT_Avg 23 came a close second. UniRep 44 , TAPE-BERT_Pool 23 , and k-sep-bigrams 33 models also provided remarkable performances. Family specific prediction scores (Fig. S5) showed that SeqVec provided the best performance in terms of enzymes. For membrane receptors, UniRep provided the best performance, followed by SeqVec. For transcription factors, TAPE-BERT_Avg took the top place. For ion channels and others, SeqVec was again the best performer.
Given that SeqVec was also very successful in the protein function prediction benchmark, especially in MF prediction, and that protein family information is related to functions, there is a plausible correlation between these results. SeqVec uses the ELMO model 46 , a bi-directional LSTM with 93M parameters capable of learning long sequential patterns, which is stated to be highly efficient for language modeling 41 . The most evident difference of SeqVec from the other successful state of the art models in our study (e.g., UniRep and TAPE-BERT) in terms of the model architecture is that SeqVec contains a CNN layer to embed the amino acids in the sequence onto a latent space, before the LSTM layers. In the original ELMO model, the same approach, charCNN 47 , was mainly used to obtain word vectors of fixed size. It is also important to mention that SeqVec displayed a moderate performance on the semantic similarity inference benchmark. This observation indicates that, although protein function prediction and semantic similarity inference can be seen as correlated tasks due to sharing the same information source, specialized solutions are required for each one. The moderate performance of SeqVec on semantic similarity inference might be explained by the noise on the original representation vectors that SeqVec produced. This noise may be filtered out by a simple feature selection during our supervised training in the PFP and drug target protein family classification benchmarks, as a result, SeqVec was successful. However, there was no supervised training in semantic similarity inference. This phenomenon was also observed in the original SeqVec study (see Fig. 2 of the SeqVec paper 28 ). In the first t-SNE plot of this figure, protein classes are distributed heterogeneously when the unsupervised model was directly used to generate protein representation vectors. Contrarily, when the representation vectors generated by the supervised model were used, the classes were successfully clustered.

Discussion
The volume of AI-based protein informatics studies has been growing lately to further the understanding of the complex relations between sequence, structure and function 48 . In this study, we evaluated protein representation learning methods in terms of their ability to capture functional properties of proteins, to be utilized, ultimately, to overcome critical challenges in protein informatics, biotechnology, and biomedicine domains. These models, with their modest resource requirements and high representation performance, can be (re)used for a variety of tasks. Thus, we argue that learned representations will play an essential role in protein research and development in the near future.
In the PFP benchmark, performances observed in CC and BP GO term prediction tasks were lower compared to the MF prediction. This observation is plausible since most of the learningbased methods use protein sequence data as input, and sequence is not a direct marker for localization (without the cleaved signal peptides) or the biological role of proteins. Also, we observed that the success rate in CC term prediction decreases with the decreasing number of annotated proteins. A similar observation was valid for MF and BP categories as well; however, the effect was less pronounced. On the other hand, we did not observe a similar trend in performance change with increasing or decreasing term specificities. Nevertheless, it is possible to state that the issue at hand is still critical since many of the informative "specific" GO terms have "low" number of annotated proteins.
In semantic similarity inference and drug target protein family prediction benchmarks, we observed that some of the learned representations are superior to the classical ones in terms of predictive performance, thus justifying the benefit of the data-driven approach to represent the functional properties of biomolecules. On the other hand, k-sep-bigrams, a classical protein representation method that does not need any training, could compete with deep learning-based protein representation methods in the PFP benchmark. These results indicate that evolutionary relationships are correlated with functional properties of biomolecules to such a degree that a simple representation that utilizes this feature can perform as good as the most complex sequence modelling methods. In the light of these results, we claim that the explicit incorporation of evolutionary information into the training of representation learning models would lead to significant improvements considering predictive performances in protein informatics.
Nevertheless, we still claim that learned protein representations, in their current state, are essential for different reasons. First of all, learning-based models produce reusable vectors, which can be optimized towards increasing predictive performance in challenging tasks (e.g., BP GO term or 3-D structure prediction) with further training on selected supervised tasks. Second, studies indicate that protein representation learning models can also be employed for designing new proteins using the learned probability distributions of the proteins in the training set 49,50 . This is not possible with classical representations. The topic of protein design is discussed further, below. Third, the results of our benchmarks indicate that the approach of learning protein representations is an excellent application field for transfer learning, as it is possible to train a model that could generate a representation vector, and use this vector for various related predictive tasks (e.g., training the model via the prediction of the next amino acid in the sequence and later using the trained model for predicting functions, interactions, or structures).
In benchmark studies, the possibility of data leak from training to test is a critical issue that should be considered during performance testing. Data leak can be defined as the accidental share of knowledge between train and test, leading to overoptimistic performance measurements. In our analyses, we observed that certain representation models performed well in tasks that are biologically related to the tasks that these models were trained on; although the data and the actual tasks were different from each other. For example, Gene2Vec utilized a hyperparameter optimization task, which aims to maximize the clustering of genes within MSigDB 51 functional pathways. This task is assumed to have latent knowledge about BP and CC based protein semantic similarity inference and ontology-based PFP benchmarks, where Gene2Vec showed a notable performance. Likewise, Mut2Vec 29 uses protein-protein interaction data for training, and we found that this model is successful in predicting BP and CC related tasks. It is highly probable that two interacting proteins are localized to the same cellular compartment or have a role in the same biological process. Finally, TAPE-BERT, one of the best performers especially in the MFbased PFP and drug-target protein family classification, is trained on protein domain sequences provided by the Pfam database. The selection of the sequence fragments based on protein families can be stipulated to have led to knowledge transfer from Pfam to the protein representation vectors calculated by TAPE-BERT. In our opinion, these cases of knowledge transfer are unlikely to be counted as examples of data leak from training to test, since the data and the tasks used in train and test were completely independent. Hence, these protein representation models should be considered successful in terms of inferring relevant information from the input data, in the scope of this study. Nonetheless, particulars of such knowledge transfer is an interesting topic to be further investigated in future studies.
There are several challenges within the field of protein representations. First of them is related to the assessment of newly proposed methods, as the proper evaluation of stability and robustness of models is critical. So far in the literature, protein representation models are tested only with small-scale datasets and limited tasks. On the other hand, there are studies (unrelated to protein informatics) in which authors proposed new approaches for rigorously evaluating the properties of data representation models [52][53][54] . These studies can be exploited and adapted for the evaluation of learned protein representations. Another key challenge is associated with model sizes. In the NLP domain, the number of parameters is steadily increasing with every new high-performance model (e.g., state-of-the-art GPT-3 model has 175 billion parameters). As most of the successful protein representation learning approaches are based on NLP models, this trend is also observed in the protein representation field 55 . This may pose a critical problem in terms of increasing the computational cost to extreme scales, especially for embedding large samples 56 using sequencebased protein representations, where each amino acid in a sequence is modelled as a word in a sentence 26 . To make a simple comparison, the average size of a sentence in English is 21.7 words 57 ; however, the median number of amino acids in human proteins is 361 58 , which makes the problem even more pronounced for protein informatics. There are potential solutions for this issue in the literature 56,59-61 , mainly proposed for NLP-related tasks. These solutions may also be exploited for protein sequence representations. It should also be noted that model sizes (e.g., number of hidden layers, total number of parameters) are not necessarily correlated with performance in protein representation models 26 . We observed this phenomenon in our benchmarks as well. For example, the UniRep model has 18.2M parameters but could compete with much larger models such as SeqVec (93M parameters) and TAPE-BERT (110M parameters). Therefore, constructing larger and more complex models may not always be the solution for better representations. Instead, investing time and resources on the incorporation of diverse types of biological data into the models would be a better choice.
Model interpretability is a critical topic to understand why a model behaves the way it does. In an interpretable (i.e., explainable) representation, all features are encoded in a distributed form, which means that the feature(s) corresponding to each dimension on the vector is known.
However, most of the learned protein representations are not interpretable/explainable. In other words, the meaning of a feature encoded in a dimension of the output vector is not known. For example, presence of a TIM barrel structure in a protein might be encoded in the 5th dimension of its representation vector, whereas, the molecular weight information may be shared between the 3rd and 4th dimensions. In the general field of data science, disentanglement studies try to associate the real properties of input samples with individual dimensions of the output vectors 62 .
The disentanglement of protein representations is a new subject, and only a few representation model developers have explored this issue so far 44,49 , as a result, there is yet to be a systematic approach. Therefore, systematic benchmarking platforms are required for the standardized evaluation of protein representation model interpretabilities.
Most of the protein representation models so far are trained using only one type of data (e.g., protein sequence). However, protein knowledge is associated with multiple types of biological data, such as protein-protein interactions (PPI), post-translational modifications, gene/protein (co)expressions, etc., along with sequences. According to the best of our knowledge, only a few of the available protein representation models are trained with more than two types of data 29,63 .
Among the methods in our benchmark study, Mut2Vec 29,63 harmonized multiple types of data (i.e., PPIs, mutations and biomedical texts), and produced more accurate results than many of the solely sequence based representations in GO BP and CC based PFP. We propose that the integration of additional types of protein related data may further augment the accuracy in predictive tasks. Furthermore, there is a clear requirement in the literature for holistic protein vectors that can effectively represent proteins from a generalized point of view, to be used for various different proteins informatics-related purposes. In our opinion, it may be possible to create these holistic representations by concatenating multiple representation vectors that were previously constructed using different types of biological data (as a means of pre-training), and training a new model using the integrated version of these vectors over high-level supervised tasks such as predicting biological processes and/or complex structural features (Fig. S6).
Protein design is one of the key challenges in biotechnology 64  We believe protein representation learning approaches will have an influence on various fields of the protein science with real-world applications, in the near future, thanks to their flexibility to integrate heterogeneous protein data at the input level (i.e., physical and chemical properties/attributes, functional annotations, etc.), and their ability to efficiently extract complex hidden features.

Methods
In this section, we first explain different approaches in representing proteins, along with an evaluation of representation learning methods from a technical point of view. A section specifically dedicated to classical representation methods is provided under supplementary information section 1. We then summarize representation learning methods according to their training objectives and provide detailed information about those we included in our benchmark analyses (the rest of the protein representation methods are detailed in supplementary information section 3). Finally, we present methodological details regarding the datasets, modelling approaches, training and test procedures and performance metrics for each benchmark task (i.e., semantic similarity inference, ontology-based protein function prediction and drug target protein family classification). We shared the source code, models and datasets of this study in our repository (https://github.com/serbulent/TrainableRepresentationAnalysis) so that the data can be used by other groups for benchmarking new representation models and to compare the results with the ones provided in this study.

Different Approaches for Representing Proteins
Feature vectors should ideally represent relevant properties of the data at hand (e.g., physical, chemical, or biological properties while representing proteins). For example, within the classical protein representation approach, a protein can be represented as a 2-dimensional numeric vector where the first dimension corresponds to the mean hydrophobicity value, and the second is the mean net charge 88 . Using these simple vectors as input, a classifier can be trained (Fig. S1a). In another example shown in Fig. S1b, the representation learning approach is employed, where Gene Ontology (GO) based functional annotations of proteins are used as the input data. An initial binary matrix that displays the associations between proteins and GO terms, is decomposed into latent protein and GO term matrices using matrix factorization. Afterwards, a predicted protein vs. GO term matrix is calculated with the dot product of the first, and the transpose of the second latent matrices. The error between the original and predicted matrices is used to update the parameters of the model during training. When the training is finished, each row of the finalized latent matrices is used as a feature vector that represent the respective protein or GO term. These feature vectors can then be used as input to other classification or clustering models for different predictive tasks.
In the domain of natural language processing (NLP), one of the first contemporary word representation learning methods, word2vec, was developed by Mikolov et al. 89 . Word2Vec is an unsupervised learning network that calculates a vector representation for each word in a text. In word2vec-like NLP models, the learning process is based on the co-occurrence of words. During the training of a word2vec model with the skip-gram architecture, the vector that represents the current word is optimized considering the correct prediction of surrounding words. In a successful representation model, e.g., the words "protein" and "gene" will be proximally located in the feature hyperspace (i.e., they are found to be semantically related to each other), since they are frequently observed together. Additionally, the model may also discover relationships between these words transitively, if, for example, the word pairs "protein -mRNA" and "mRNA -gene" are observed in the input text samples. Word2vec is an example of shallow data representation approaches, in which only one effective data processing layer (i.e., the hidden layer in an artificial neural network) is presented. Word2vec laid the foundation of many widely-used data representation learning methods available today, including protein representations. Later, deeper models, in which there are more than one effective data processing layer 90

Summary of the Evaluated Protein Representation Learning Methods
We group protein representation learning methods' technical approaches (Fig. 2a), and objectives and applications reported in their respective publications (Fig. 2b). Here, we formed five main categories according to the application domains; (i) protein interaction prediction (essential for understanding molecular mechanisms and pathways), (ii) physicochemical feature prediction (important for protein engineering and drug discovery related tasks), (iii) genetic feature prediction, (iv) protein function prediction, and (v) structural feature prediction. Fig. 2b categorizes the main domains and specific application fields under each one. Methods with more than one objective were classified according to their major objective. The methods that we included in our benchmark study (i.e., LearnedEmbeddingVec 27 , SeqVec 28 , Mut2Vec 29 , Gene2Vec 30 , TCGA_Embedding 31 , ProtVec 4 , TAPE-BERT_Avg 23 , TAPE-BERT_Pool 23 , UniRep 23 , and the two classical representations, APAAC 32 and k-sep-bigrams 33 ) are described below, the rest of the methods are explained in supplementary information document section 3. All of these protein representation methods are summarized in terms of their technical aspects (e.g., learning approach, algorithm, etc.), input data types, vector sizes, objectives, applications, importance and available data repositories in Table 1.
Asgari et al. 4 were one of the first to create a protein representation learning model to represent proteins by applying word2vec 89 in a method they named ProtVec. The authors treated each protein sequence as a sentence, and each k-mer (i.e., k length amino acid sequence) as a word.
Authors claimed that the method could be employed for different problems in protein biology including protein function prediction and protein interaction prediction. They evaluated the performance of ProtVec in predicting the mass, volume, polarity, hydrophobicity and charge of proteins, as well as its accuracy in disordered protein classification.
In the study by Yang et al. 27 , learned protein embeddings (i.e., representation vectors) with sizes ranging from 4 to 128 dimensions are constructed using the doc2vec algorithm 42 on nonoverlapping k-mers. We will call this method "LearnedEmbeddingVec" in the rest of this study, as In the study conducted by Du et al. 30 , the method Gene2Vec is proposed, where 200-dimensional vectors are calculated to represent genes, using skip-gram 89 . Hyperparameter tuning (e.g., vector size and window size optimization) was applied with the objective of maximizing the clusterdness of genes within MSigDB 51 functional pathways. The input data, gene co-expression profiles, were gathered from the GEO database 116 . The major objective of the study is predicting gene-gene interactions (i.e., the genes acting in the same biological process), in which Gene2Vec was reported to be successful. Additionally, it was indicated that the model could summarize latent semantic information about genes by accurately representing functional similarities over tissue specific gene clusters. We employed gene representation vectors of Gene2Vec in our benchmarks by mapping them to canonical forms of their respective gene products (proteins).
Kim et al. 29 trained a mutation representation model named Mut2Vec. The aim of the proposed model was the classification of mutations according to their disease-causing effects. In Mut2Vec, mutation co-occurrence information, protein-protein interaction (PPI) networks (from BioGRID), and biomedical literature abstracts (from PubMed) were used to construct the representation model. Among alternatives, the model that utilizes the co-occurrence information via skip-gram 89 was chosen as the finalized representation model. In the Mut2Vec workflow, first mutation cooccurrences and PubMed texts were used to calculate representation vectors. PPI data was integrated at the post-processing phase, using a retrofitting process similar to WordNet 117 . The authors stated that Mut2Vec could separate passenger and driver mutations successfully, and that, it produces promising results in the detection of new cancerous mutation candidates.
Heinzinger et al. 28 used Embeddings from Language Models (ELMO), a bi-directional LSTM which is popular in the NLP domain 46 , to represent proteins using unlabelled protein sequence data. The authors aimed to solve the problem of global representation methods' shortcomings in inferring information from the local context, with their method, SeqVec, using a smaller version of ELMO with 244k parameters, resulting in a significant speed advantage over up-to-date language models such as BERT 91 . However, their results show that the smaller model could not surpass the state-of-the-art methods, especially on sequence level tasks such as secondary structure prediction. On the other hand, the model produced competitive results for protein level tasks such as subcellular localization prediction. In our opinion, this can be attributed to the extensive training dataset and the successful detection of conserved sequence patterns by the LSTM.  31 . In the proposed method, first, a gene expression matrix of cancer samples was prepared using data from the TCGA database. The authors then applied a matrix decomposition with a fully connected neural network layer. Next, through matrix multiplication on these decomposed matrices, they created a predicted version of the original matrix. The error between the original and predicted matrices was used for backpropagation. The decomposed matrices consisted of gene-features and samples-features as dimensions. The authors showed that semantic relationships between samples and genes are conserved in their model. Even though the gene expression levels were not correlated, functionally related genes are observed in adjacent locations, when the multidimensional distance was calculated on the representation vectors. Additionally, when the representation vectors were inspected, it was seen that similar cancer types were clustered in the representation space to the extent that the authors claim that molecular subtyping of cancer was possible using the representations. We refer to this method as "TCGA_Embedding'' throughout this paper, as the authors did not provide a specific name for their model in the original study.
Finally, two widely used classical representations (APAAC and k-sep-bigrams) were included to our benchmark, as baseline models. Amphiphilic Pseudo-Amino Acid Composition (APAAC) utilizes the physicochemical properties of amino acids together with amino acid compositions 32 , and k-sep-bigrams makes use of the evolutionary relationships between proteins 33 . We evaluate the performance of learned representations in comparison to these baselines, to assess the value added by the newer methods. Below, we look at these baseline methods in more detail. Evolutionary information is widely used in classical protein representations. In k-separatedbigrams method, row-type matrix transformations on position specific scoring matrices (PSSM), which are constructed using multiple sequence alignments generated from the query sequence and its homologs, are utilized for calculating the bigram transition probabilities between residues that are "k" positions apart from each other. The final representation vectors have the size of 400x1, each dimension representing a specific transition probability from one amino acid to another (20x20). The method was reported to be successful in predicting type IV secretion effectors 33 .

Semantic Similarity Inference Benchmark
To construct the dataset of semantic similarity inference benchmark, we downloaded all human protein entries in the UniProtKB/Swiss-Prot database and their GO term annotations from the UniProt-GOA database in the 2019_11 release. The electronically inferred annotations, labelled with the "IEA" evidence code, were excluded from the dataset; leaving only the annotations reviewed by human experts. After that, we enriched the dataset by propagating the annotations to the parent terms of the asserted GO terms on the directed acyclic graph (DAG) of GO, according to the true path rule. Our finalized annotation dataset contained 14,625 distinct GO terms (3,374 of them belonged to molecular function -MF, 9,820 belonged to biological process -BP, and 1,431 belonged to cellular component -CC categories) and 326,009 annotations (75,884 of them belonged to MF, 154,532 belonged to BP, and 95,593 belonged to CC categories).
To be used as the ground-truth/reference data in this benchmark, we calculated the true GObased semantic similarities between all proteins in our dataset independently for all GO aspects (i.e., MF, BP and CC) using Lin similarity in the GoSemSim package 121 . Lin similarity 34 is based on Shannon's information theory, which states that the information content (IC) of an event is negatively proportional to the observation probability (P) of the event. Information content (IC) is formulated as; !"($) = ()*(1/$) (1) Another concept used in Lin similarity is the least common subsumer (LCS). LCS is the first common ancestor of the two GO terms when traveling to the root in the hierarchical GO graph.
Hence, Lin similarity is defined as; More information about the semantic similarity measures can be found in the literature 122 .
Next, we prepared four protein semantic similarity datasets (i.e., "all proteins", "well annotated This final dataset contains 247 similarity scores between 40 different proteins (labelled as "sparse uniform" in the related figures). Thus, among our 4 datasets, "sparse uniform" is the most trivial one to predict and "all proteins" is the most challenging.
In the benchmark phase, we compiled the protein representation vectors for the human protein entries in our dataset using the selected representation learning methods, which are Gene2Vec 30 , At this point, we had two pairwise similarity arrays at hand; the first one was calculated by taking the GO-derived semantic similarities between the proteins in our dataset into account (i.e., true semantic similarities), and the second one consisted of pairwise similarities calculated directly from the representation vectors.
Finally, to observe and to compare the performance of protein representation models for inferring these semantic similarities, we calculated the Spearman rank-order correlation 123 values (explained below under "Performance metrics" sub-section) between the ranked lists of representation vector similarities and true semantic similarities.

Ontology-based Protein Function Prediction Benchmark
The details of the dataset preparation procedure for the protein function prediction benchmark is explained below in six steps. For each GO category (i.e., MF, BP, CC); 1) We obtained human proteins and their GO term annotations from the "2019_10" version of UniProtKB/Swiss-Prot and UniProtGOA databases, respectively.
2) We excluded all electronically made annotations (evidence code: IEA) from the list of GO term annotations with the aim increasing the reliability of annotations and to prevent error propagation during prediction.
3) For each GO term, we created an individual list that includes the accessions of the annotated proteins, to be used in model training and testing via cross-validation. We filtered each protein list using the UniRef clusters 36 by only selecting the representative protein entry from each cluster. UniRef provides protein clusters that are formed based on sequence similarity.
We used UniRef50 clusters, to ensure that there are no protein sequences with more than 50% sequence similarity in each list. Here, the aim is to create train/test datasets without similar proteins that could otherwise introduce a bias to the analysis. 4) GO terms were grouped as either "low", "middle", or "high" according to the number of annotated proteins. GO terms with 2 to 30 annotated proteins were placed in the "low" group, terms with 100 to 500 annotated were placed in the "middle" group and terms with more than 1000 annotated proteins were placed in the "high" group. We deliberately left margins between groups to obtain a clear separation.
5) The specificity of the GO terms was determined as either "shallow", "normal", and "specific".
In the directed acyclic graph of GO, terms in the first ⅓ of the max depth of that branch were considered as "shallow", terms in the second ⅓ of the max depth of that branch were categorized to "normal", and the deepest ⅓ were placed to "specific" group. It should be noted that the max depth varies according to GO category.
6) Based on the combinations of groups constructed in steps 4 and 5; a total of 9 GO term groups (3x3) were formed for each GO category (i.e., MF-low-specific, BP-high-shallow and etc.), making a total of 27 groups (9x3). There were no GO terms that correspond to two of these groups (e.g., MF-high-specific and CC-high-specific), and therefore, these groups are left out of this analysis. Since most of the remaining 25 groups were highly crowded, we selected 5 terms from each group for further evaluation (4 groups already had less than five GO terms, thus, they were directly incorporated without further selection). We intended to select dissimilar GO terms to be able to generalize the results over the whole functional spectrum, as much as possible. For this, we calculated pairwise semantic similarities between GO terms using Lin similarity, and 5 most dissimilar terms were chosen for each group. The statistics of the finalized datasets are given in Table S2 and the identifiers of the selected GO terms are given in Table S3.
Using these datasets, prediction models were constructed (one for each group, mostly made up of 5 GO terms) for each protein representation model using the "Linear Support Vector Classification" module of the scikit-learn library 43 within a multi-task modelling approach, making a total number of 275 prediction models (25 GO groups x 11 representation models). A 5-fold cross-validation was used to evaluate performance for each model. The hyperparameters of the SVM, for all models, were selected based on the default values; the regularization parameter (C) was set to 1.0, L2 norm was selected for error penalty, and the squared hinge was chosen for the loss function. Since the linear classification model is simple, we assumed that the effect of the hyper-parameter selection would be minimal.

Drug Target Protein Family Classification Benchmark
To construct our family classification benchmark dataset, we employed ChEMBL database (v.25) 45 , which contains curated collections of drug/compound-target protein interaction data (i.e., bioactivities) to be utilized for experimental and computational research in drug discovery and development. Taking into account the hierarchical target protein categorization system presented in ChEMBL, we use 4 broad target protein families and grouped the rest of the targets as a fifth category (i.e., enzymes, membrane receptors, transcription factors, ion channels, and others).
The number of proteins for each family and representation method is shown in Table S6. Small differences between the dataset sizes of different representation methods was due to the availability of vectors, and assumed to be negligible (the largest difference was around 3%).
UniRef filtering was not used in this benchmark due to small sizes of original datasets (e.g., transcription factors class was composed of 82 human target proteins). The family information is used as class labels for the multi-task training of the target protein family classification model.
The stochastic gradient descent classifier -SGDClassifier-(i.e., a linear SVM) from the scikit-learn library 43 was used with "OneVsRestClassifier" option for the multi-task classification. The classifier was used with default parameters; SVM for fitting the SGDClassifier, hinge as the loss function, and L2 norm as the error penalty. The model was trained and tested with 10-fold crossvalidation. The whole process was repeated 100 times, and the average results are reported.

Performance metrics
In our semantic similarity inference benchmark, we used Spearman rank correlation 123 . For a sample with size n and the ranks of variables !" ! ! and !" " ! , Spearman rank correlation (! # ) can be defined as: where difference between ranks for observations is defined by: For ontology-based protein function prediction and drug-target protein family classification benchmarks, we mainly used recall, precision, F1-score, accuracy, Matthews correlation where u and v are 1-dimensional arrays of real and predicted class labels, respectively, $is the Kronecker delta function, and k is the vector dimension.
In ontology-based protein function prediction benchmark, F1-score and its components precision and recall are weighted inversely proportional to the class sizes. Since the classes were highly imbalanced, the weighting operation was required for an unbiased analysis. In both ontologybased protein function prediction and drug target protein family classification benchmarks, models were designed as multi-task (i.e., 5 GO terms are predicted by one function prediction model, and 5 proteins families are predicted by one family classification model). In ontologybased protein function prediction benchmark, the models were also designed as multi-label, where more than one GO term can be predicted to a test protein (since a protein can have more than one function). In this setting, a random predictor would produce a correct prediction once out of 32 cases (i.e., 2 5 different combinations exist for a label vector of size 5x1, one of which is the true label vector). Whereas, models are designed as single-label in the drug target protein family prediction benchmark (since each protein can only belong one of the main families), meaning that a random predictor would produce a correct prediction once out of 5 cases (i.e., only 5 different combinations exist for a label vector of size 5x1, one of which is the true label vector).

Availability
All of the datasets, results and the source code of this project are available for download at: https://github.com/serbulent/TrainableRepresentationAnalysis.      Tables   Table 1. A near-comprehensive list of protein representation learning methods including names, references, learning approach, depth of the system, utilized machine learning algorithm, input data type, vectors sizes, objectives and applications of systems, the importance of studies, and the availability as tools or source code (vector sizes vary for some of the methods, in those cases, we indicate the vector sizes that yield the best predictive performance).