PPPred: Classifying Protein-phenotype Co-mentions Extracted from Biomedical Literature

The MEDLINE database provides an extensive source of scientific articles and heterogeneous biomedical information in the form of unstructured text. One of the most important knowledge present within articles are the relations between human proteins and their phenotypes, which can stay hidden due to the exponential growth of publications. This has presented a range of opportunities for the development of computational methods to extract these biomedical relations from the articles. However, currently, no such method exists for the automated extraction of relations involving human proteins and human phenotype ontology (HPO) terms. In our previous work, we developed a comprehensive database composed of all co-mentions of proteins and phenotypes. In this study, we present a supervised machine learning approach called PPPred (Protein-Phenotype Predictor) for classifying the validity of a given sentence-level co-mention. Using an in-house developed gold standard dataset, we demonstrate that PPPred significantly outperforms several baseline methods. This two-step approach of co-mention extraction and classification constitutes a complete biomedical relation extraction pipeline for extracting protein-phenotype relations.


INTRODUCTION
Proteins are one of the most critical biomolecules for the development and maintenance of life [3]. A cell's full complement of expressed proteins, the proteome, is both dynamic and multidimensional with many proteins operating in a complex network ensuring the integrity of cellular structure and function [20]. Changes in critical regions of a protein's structure often caused by errors in the underlying genetic sequence of the protein or in its regulation can alter the protein's function-specific 3D structure, resulting in an alteration of phenotype [11]. In the medical context, a phenotype can be characterized as a deviation from normal morphology or behavior [32]. Well known alterations in phenotype brought about by changes in one or more proteins or their regulation involved in important biological pathways include Alzheimer's disease, Parkinson's disease, Huntington's disease, cancer, cystic fibrosis and type II diabetes [3,12,25]. Uncovering novel changes in protein structure, function and regulation, and understanding how these alterations lead to human disorders is a very active area of research in the biological community [3,5,11,12,20,25,31,35].
Human Phenotype Ontology (HPO) is a standardized vocabulary that includes a wide range of phenotypic abnormalities observed in human diseases [17]. HPO is composed of five sub-ontologies among which Phenotypic abnormalities is the main sub-ontology that describes clinical abnormalities. Each sub-ontology includes HPO terms and associated HPO Identifiers (IDs), e.g. Parkinsonism, HP:0001300. Each sub-ontology is organized in a hierarchical structure where more general terms are close to the top while more specific terms are closer to the bottom. Each pair of terms in the hierarchy are linked with a is-a relationship. In this paper, we use phenotypes and HPO terms, interchangeably. HPO website 1 provides gold-standard annotations for a large collection of human proteins through biocuration, which is the process of extracting knowledge from unstructured text and storing the data in knowledge bases. However, currently, only a small portion of known human proteins have HPO annotations [17]. But, it is believed that there are many other human proteins that are associated with diseases and hence Continuing to expand the knowledgebases such as HPO database through biocuration is of utmost importance for potential future downstream applications in medicine and healthcare. However, biocuration, which is usually performed manually with the help of computational tools [9], is generally considered tedious and resource-consuming. Hence, efficient and accurate computational tools are required to expedite the process in order to bridge the gap between the typically slower rate of human annotation versus the vast and exponentially-increasing amount of literature concerned with the subject [9]. As a result, developing computational models to extract relations between proteins and phenotypes has gained recent interest among scientists working in the field of biomedical natural language processing [10,16,18,37]. However, to the best of our knowledge, no such computational methods exist for automatically extracting human protein-HPO term relations from biomedical literature. In this paper, we use the terms "relations" and "relationships", interchangeably.
As a solution to the above, we propose a two-step approach for extracting human protein-HPO term relations. The first step is to extract protein-HPO co-mentions, which are co-occurrences of protein names and phenotype names in a certain span of text i.e. a sentence, a paragraph, etc [16]. In our previous work, we developed ProPheno 2 , which is an online and publicly accessible dataset composed of proteins, phenotypes (HPO terms), and their co-occurrences (co-mentions) in text which are extracted from Medline abstracts and PubMed Central (PMC) Open Access fulltext articles using a sophisticated in-house developed text mining pipeline [29]. This dataset covers all terms in the Phenotypic abnormality sub-ontology. However, a knowledge-free Natural Language Processing (NLP) pipeline extracts every co-mention of proteins and phenotypes, but not all protein-phenotype co-mentions simply imply that there is a relationship between the two entities (see Figure 1 for an example).
Therefore, in the second step, extracted co-mentions are filtered using a co-mention classifier that can distinguish between good and bad co-mentions. We define a co-mention as a good co-mention if there is enough evidence conveyed in the corresponding span of text indicating a relationship between the protein and the phenotype. In other words, a good co-mention is a valid relationship between the two entities according to the meaning of the context text. Figure 2 depicts an example of a good co-mention of a protein and a phenotype in a sentence. The combination of a co-mention extractor and co-mention classifier/ filter constitutes a complete relation extraction pipeline.
The development of PPPred (Protein-Phenotype Predictor), a novel co-mention classifier for classifying protein-phenotype comentions, is the primary focus of the work presented in this paper. We first randomly select a subset of co-mentions from the ProPheno database and have them curated through two biologists. This goldstandard dataset is composed of 809 human protein-HPO term co-mentions annotated with binary labels of good/ bad. Then we use this gold-standard dataset for developing predictive models using machine learning techniques. Our machine learning models employ a large collection of both syntactic and semantic features. Finally, we demonstrate that PPPred significantly outperforms other baseline methods on the task of protein-HPO terms co-mention classification.
The main contributions of the paper are as follows. This is the first analysis of the problem of human protein-HPO term relation extraction from biomedical literature. We model this relation extraction task as a two-step process composed of co-mention extraction and classification. We formulate the co-mention classification problem as a supervised learning problem using the gold-standard data generated by biologists. This is also the first such gold-standard data for human protein-HPO term relation extraction and is made publicly available 3 . A filter or a classifier that could identify good co-mentions can be used by annotators to significantly speed up the biocuration process. In addition, this can be used to provide much higher quality co-mentions as input to other downstream applications such as human protein-HPO term prediction [28], which would likely lead to better predictions.
The rest of the paper is organized as follows. Section 2 provides a brief background on the related work in this area. The proposed method is discussed in Section 3. Section 4 discusses the results of running this method and compares the results with other methods and provides a discussion on the results. Finally, Section 5 concludes the study and discusses future work and open problems.

RELATED WORK
The main approaches for biomedical relation extraction include co-occurrence-based methods, rule-based methods, and machine learning-based methods. Co-occurrence methods simply look for any co-mention of the two entities of interest in a particular span of text, e.g. sentence, paragraph, etc., and usually provide low precision and high recall values [4]. Rule-based methods define linguistic patterns and extract the relations using the patterns [1,13,22,26,30]. The rules can be derived from manually annotated corpora Figure 2: An example of a sentence-level protein-phenotype co-mention which is extracted from the article PMID: 798461. using machine learning algorithms or defined manually by a domain expert. Several studies focus on employing lexical analyzers and parsers to identify the relations between entities [7,34,38,39]. For example, Yakushiji et al. introduce a full parser for analyzing biomedical text using a general-purpose parser [39].
Machine learning-based approaches are also employed for the relation extraction from biomedical text [16,19,21,36]. The machine learning category includes methods based on feature engineering, graph kernels, and deep learning. Support Vector Machines (SVMs) have shown high performance in biomedical relation extraction, but they need feature engineering which is a skill-dependent task [40]. Kernel-based methods also require designing suitable kernel functions. Deep neural network-based methods eliminate the need for feature extraction and defining rules, and provide state-of-the-art on various tasks in biomedical relation extraction [27,40]. However, they typically require very large data sets compared to other traditional machine learning models.
Craven presents a machine learning method for mapping information from Medline abstracts to knowledge bases [8]. Katrenko and Adriaans propose a method that uses syntactic information and can be used with various machine learning methods [15]. Supervised and unsupervised methods have also been employed in different studies that show improvement in the relation extraction task [2,18,23,33].
Khordad and Mercer introduce a machine learning method for identifying genotype-phenotype relations which uses a semi automatic approach for annotating more sentences to enlarge the training set [16]. Extracting the relations between genotypes and phenotypes can also be performed by combining molecular and phenotypic information [10].
Despite a large number of studies conducted on extracting entity relations from the biomedical literature (including a handful of methods for extracting relations between genes/proteins and phenotypes), no methods exist specifically for human protein-HPO term relation extraction. Therefore, to the best of our knowledge, this is the first study on the problem of protein-HPO term relation extraction from biomedical literature and the PPPred is the first such method. We note that GenePheno [14] is the only related method that uses an ontology-based approach to extract gene-phenotype associations from the literature. It first recognizes all mentions of gene and HPO terms within sentences in the whole corpora and then uses a co-occurrence based metric for ranking those pairs. Highest ranked pairs are predicted as gene-phenotype associations. While GenePheno does not predict top-ranked relations (i.e. sentences), we still use it as one of the baseline methods due to the close proximity of the problem solved by their method and the task of protein-phenotype co-mention classification addressed by PPPred.

METHODOLOGY 3.1 Approach
In this work, we formulate the task of co-mention classification as a supervised learning problem as described below.
Given a context C = w 1 w 2 ..e 1 ..w 3 ..e 2 ..w n−1 w n composed of words w i and the two entities e 1 and e 2 , we define a mapping f R (·) as: if e 1 and e 2 are related according to R 0 otherwise, where T (C) is a high-level feature representation of the context, e 1 and e 2 are the entities representing the protein and the phenotype and R is the relation that represents the protein-phenotype relationship between the two. An example is considered a positive example if the meaning of the context suggests that the protein mentioned has this function (i.e. a good co-mention). Otherwise, it is labeled as a negative example. In this work, the context C is a single sentence (i.e., the sentence containing the mentions of the two entities). Figure 2 depicts a sentence which is labeled as a positive example (i.e., f R = 1) because it provides evidence for the relationship between the two entities "Insulin" (protein) and "Atherosclerosis" (phenotype). We model this problem as a supervised learning problem and use binary classifiers for learning f R . Figure 3 depicts the overview of the PPPred pipeline, which is capable of classifying sentence-level co-mentions of proteins and phenotypes from biomedical literature. In this figure, we start by inputting a set of sentences that contain co-mentions of proteins and phenotypes. The preprocessing step is comprised of tokenization, removing punctuations and stop words, and stemming. In the next step, we extract features from the input sentences and train the model which is able to extract the relations. The steps are discussed in detail in the following sections.

Dataset
The first step in building a co-mention classifier is to create a manually-annotated gold-standard dataset of co-mentions of proteins and phenotypes. For this purpose, we use ProPheno 1.0 [29], which is a dataset of proteins-phenotypes extracted from the entire biomedical literature. This dataset maps the proteins and phenotypes to the corresponding UniProt 4 IDs and HPO IDs. We randomly select a dataset of 809 sentence-level co-mentions of proteins and phenotypes from ProPheno. This dataset is then annotated by two biologists to generate the gold-standard dataset. The annotators were provided instructions to label a co-mention as good/ positive  if the sentence conveys that the protein and the phenotype has a relationship. Otherwise, the co-mention was labeled bad/negative.  Table 1 shows the distribution of co-mention types in the goldstandard dataset. According to the Table 1, 39% of sentences are extracted from the abstracts and 61% are from the full-text articles. Among the sentences from the abstracts, 53% are labeled as "good" and 47% are labeled as "bad". The distribution for the sentences from the full-text articles is 70% and 30% "good" vs. "bad", respectively. The overall class distribution is 64% and 36% for "good" and "bad", respectively. The inter-annotator agreement is calculated using the Cohen's Kappa statistic [24] and the corresponding value is 0.64 that shows substantial agreement. Tables 2 and 3 show the most frequent phenotypes and proteins in the dataset, respectively. According to the tables, 15% of the sentences mention the protein "Receptor tyrosine-protein kinase erbB-2" (P04626) and 43% of the sentences discuss the HPO term "Neoplasm" (HP:0002664) (other names: "Cancer" or "Tumour"). Table 4 also demonstrates the most frequent protein-phenotype  pairs mentioned in the dataset. We observe that 10% of the comentions in the dataset mention above protein-phenotype pair, which shows this pair is a well-studied protein-phenotype pair. Figure 4 depicts the distribution of the depths of HPO terms in the gold-standard.

Preprocessing
In the next step, we perform preprocessing on the sentences, which is basically employing tokenization, and removing highly frequent words from sentences (stop words), and also performing lemmatization. In this step, we replace protein and phenotype entities by PROT and PHENO, respectively. This replacement helps us to keep track of the actual labels when the sentence contains more than one entity with the same name and helps to avoid confusion when the entity names contain more than one word.

Feature Extraction
We define the following items as the features for classification. These features are categorized into three major types, i.e. bag-ofwords, engineered features, and distantly supervised (DS) features.

Bag-of-words (BoW) Feature.
Here each feature is a token from the context sentence while the feature value is their corresponding frequency.  (2) informative features used with similar relation extraction problem [21]. The full list of engineered features and their value type (within parentheses) is as follows: (1) Shortest dependency path between PROT and PHENO in the dependency graph of the sentence (integer).

) The number of tokens in sentences (integer). (5) Existence of interaction words acquired from a study by
Chowdhary et al [6] (boolean). (6) Existence of seven trigger words provided by biologists, e.g.

DS Features.
We obtained the DS features by utilizing (1) the full set of co-mentions (i.e. unlabeled) available in ProPheno, and (2) the annotations available in the HPO database, which we call the silver-standard (SS). These features are listed in detail as follows: (1) Number of co-mentions containing the protein name (integer). We normalize the number of co-mentions containing protein name, phenotype name, or a pair of protein-phenotype by dividing their frequencies by the number of unique articles that contain that specific protein, phenotype, or pair, respectively. We also propagate the HPO annotations upward toward the root nodes by using the true path rule that means if an HPO term has an annotation with a specific protein, all of its ancestors are also annotated with that protein.

Experimental Setup
The scikit-learn 5 package is used for implementing the classifier functionality. We normalize the feature vectors using the L2 norm. In a preliminary analysis, we compared various supervised learning algorithms such as SVM, Naïve Bayes, Decision Trees, K-Nearest Neighbors (KNN), and Gradient Boosting Trees (GBT) using their default parameter settings. We select SVM with Linear kernel for the rest of our experiments. We perform 10-times 5-fold crossvalidation for evaluating the models. The performances are reported primarily using F-max (the optimal F-1 value). Precision and Recall at F-max are presented as well. Precision is the fraction of true positives over all the predicted positives, whereas recall is the fraction of true positives over all the actual positives. F-1 is the harmonic mean of precision and recall.
We compare PPPred with three baselines: (1) a strict rule-based method (rule-based 1), (2) a lenient rule-based method (rule-based 2), and (3) GenePheno [14]. The rule-based 1 method was developed in-house by a biologist using broad domain knowledge of the language used when describing alterations in protein sequence, activity, regulation and the resulting phenotypic changes. Commonly used words for sequence-based alterations included "mutation", "deletion" and "insertion". For protein expression changes, the phrases "upregulation", "upregulates", "downregulation", "downregulates", "over-expression", "under-expression", "switches off", "switches on", "amplifies" and "enhances" were chosen. For direct protein-phenotype relationship descriptions, the phrases "associated with", "triggered by" and "caused by" were used. This method assigns a score of 1 to co-mentions satisfying at least one of the following rules (and 0 otherwise): • PROT (upregulation/ downregulation/ over-expression/ underexpression/ mutation) causes/ does not cause/ is (not) associated with PHENO • some other entity (upregulates/ downregulates/ silences/ inhibits/ switches off/ switches on/ triggers/ activates/ amplifies/ over-expresses/ under-expresses / enhances) PROT causing/ which causes/ which is associated with PHENO • PEHNO is (not) associated with/ triggered/ caused by (upregulation/ downregulation/ mutation/ deletion/ insertion/) in PROT • Mutation/ deletion/ insertion in PROT causes/ is associated with PHENO The rule-based 2 method is lenient than the rule-based 1 method because it only checks whether any of the keywords in the rulebased 1 method is in the sentences. In other words, this method assigns a score to a co-mention based on the keyword(s) present in the sentence. The order or the position of the keywords (with respect to PROT and PHENO entities) are not considered.
GenePheno [14] is an ontology-based text mining method for predicting gene-phenotype associations using literature. While acknowledging this is not an apples-to-apples comparison, we perform the following in order to adapt it as a baseline. For each co-mention in our gold-standard, if the corresponding pair of the protein and the phenotype exists in the pre-generated GenePheno output file 6 , we consider it as a positive prediction (otherwise negative). We incorporate the NPMI (Normalized Pointwise Mutual Information) scores provided by GenePheno for each co-mention as the confidence scores for the predictions. Note that due to the possibility of the GenePheno method having access to some or all of the co-mentions from our test set, the performance we report is likely an over-estimation. Table 5 demonstrates the F-max, precision at F-max, and recall at F-max values of various supervised learning algorithms. We observe that the Linear SVM and Gradient Boosting Trees algorithms achieve best the F-max value (0.8). In addition, the Decision Trees, Naïve Bayes, and K-Nearest Neighbors algorithms provide F-max values of 0.78, 0.79, and 0.78, respectively. However, by comparing the precision values, we realized that Linear SVM and Gradient Boosting Trees provide higher precision values. Since Linear SVM is one of the top models among all the models we compared, we use that for the rest of our experiments.   Table 6 shows the comparison of the results of running PPPred against two rule-based methods and GenePheno. We observe that rule-based 2 and GenePheno obtain similar values for precision, recall, and F-max, whereas Linear SVM produces a higher F-max value. Linear SVM also achieves higher precision value than the rule-based 2 and GenePheno methods. Due to the lack of confidence scores for the rule-based 1 method, we report the F1-score instead of F-max. We performed the paired T-test on the values to compare the significance of the difference between F-max values. We observed that Linear SVM significantly outperforms other methods by achieving a p-value of 4.3E-13. Figure 5 provides a comparison between the effectiveness of various features on the sentences from the abstracts, full-text articles, and all sentences. The results suggest that we obtain better performance using the co-mentions from the sentences extracted from the full-text articles in comparison with the sentences extracted from the abstracts. The precision values of co-mentions extracted from full-text articles are higher than the values obtained by the abstracts. In other words, the co-mentions extracted from full-text articles could be a valuable source of information for relation extraction. The next observation is that BoW features often provide good performance in terms of precision, recall, and F-max that indicates the BoW features are an essential feature for relation extraction. Engineered features provide higher precision in comparison with DS features, whereas the DS features achieve higher recall values. This observation suggests that these two sets of features can be used as complementary features for relation extraction.

RESULTS AND DISCUSSION
We investigate whether the training set suitably represents the problem by employing the learning curve with training sizes 20%-90% of the data and predicting on the holdout 10% of the data.   [36].

PSA tumour
Our findings, which demonstrated prognostic value of p-eIF2 in PHENO, are partially consistent with this previous research, because PROT is also involved in the PERK-p-eIF2 signaling pathway and predicts better DFS in patients with breast cancer.

Sentence
Protein Phenotype Pedigree analyses of five families in which a form of spinocerebellar PHENO (SCA1) is present have been used to obtain additional information on the location of PROT on chromosome 6.

SCA1 ataxia
The ratio of free to total PROT may increase the specificity of single serum PSA evaluations without decreasing its sensitivity for the diagnosis of prostate PHENO.
PSA cancer Figure 6 depicts the learning curve with the mentioned training sizes. The increasing value of F-max shows that the dataset is underrepresentative of the problem and we need more training data. Relatively low precision values observed using the Linear SVM algorithm suggest that we have many false positives. Therefore, to investigate the possible reasons for this observation, we picked the top five false positives (sentences which are predicted as "good" with the highest confidence scores by the model whereas their actual labels are "bad") of which the top two are shown in Table 7. We also picked the top five false negatives (co-mentions predicted as negatives with the lowest confidence scores, whereas their actual labels are positive) of which top two are shown in Table 8. By comparing the above sentences, we observe that the length of false negative and false positive sentences is similar and cannot be used as a criterion to differentiate between the co-mentions. Additionally, we observed that most of the phenotypes in the selected sentences are "cancer" or related to "cancer". Therefore, the type of entities does not fully distinguish between good and bad co-mentions and requires further investigation.

CONCLUSION AND FUTURE WORKS
In this study, we created a co-mention classifier/filter which is capable of distinguishing between good and bad co-mentions of proteins and phenotypes in sentences. We created a pipeline in which we perform preprocessing on manually-annotated sentence-level comentions of proteins and phenotypes, and by training a model on a set of features extracted from the sentences, we are able to classify the sentences comprising co-mentions of proteins and phenotypes. This classifier can be employed to perform relation extraction on protein and phenotype entities mentioned in biomedical literature. We observed that Linear SVM provides the best F-max score using five-fold cross-validation.
Nevertheless, there is still a lot of avenues to work in this area. We utilized syntactic features extracted from sentences, however, a potential future work is to use more specific syntactic features from sentences, e.g. the shape of the dependency graph. We also plan to do the classification on positive relations, negative relations, and no relations between entities to be able to extract more specific relations from biomedical literature by converting the problem into a multi-class classification. We also plan to apply deep learning and word embeddings to this dataset. We plan to incorporate the section titles, e.g. Introduction, Conclusion, etc., to employ only the more informative sentences. We also plan to utilize features based on the soft similarity between sentences and in the future, we are going to expand the study and include larger spans of text, i.e. paragraphs and documents.