Specific Peptides Predict Protein Classification

The methodology of Specific Peptides (SP) has been introduced within the context of enzymes. It is based on an unsupervised machine leaning (ML) tool for motif extraction, followed by supervised annotation of the motifs. In the case of enzymes, the classifier is the Enzyme Classification (EC) number. Here we demonstrate that this method reaches precision of 96.5% and recall of 89.1% on presently available protein sequences. We also apply this method to two other protein families, GPCR and ZF, find their corresponding SPs, and provide the code for searching any protein sequence for its classification under any such family.

Using the Enzyme Classification (EC) nomenclature, enzymes are classified into seven classes, EC1 to EC7, and within each EC class they are grouped into a hierarchy of four levels. Some are classified just into the first level, numbered by the class, some at levels 2 or 3, but most at level 4, which is often associated with homologs of the same gene in different species. Proteins which have enzymatic regions belonging to two different EC classes were discarded from the training set.
Following [4] we restricted our MEX search to motifs of length ≥7 amino acids. Details of our procedure of analysis are explained in the Methods section. Our procedure leads to a set of 286,755 specific peptides which we label as ESPs. They are provided as a Json list in our github entry [9] which also includes the code for searching a protein for the occurrence of such ESPs.
In order to test the usefulness of ESPs in predicting the EC labelling of a protein, we ran it on the test sets Ptest and Ntest. An SP hit on Ptest is regarded as true positive (TP) if the Swissprot EC assignment of the enzyme appears on the EC tree of the SP. If no SP hits an enzyme, it is labelled as false negative (FN). If an SP hits a protein in Ntest, the latter is declared as false positive (FP). If no SP hits a protein in Ntest, it is regarded as true negative (TN).
The results are presented in Table 1: G protein coupling receptors (GPCR) play dominant roles in olfaction, vision and many other cellular functions.
Olfactory Receptors (OR) were studied in [5] using motifs of length ≥5 derived by the MEX methodology. They [5] have demonstrated how the resulting motifs can be employed in providing the sketch of an evolutionary tree of species, and have provided a web-service for OR protein assignment on the basis of these motifs.
We extend our analysis to all Swissprot GPCRs. After motif extraction we start with human GPCRs, and exhibit the specificity of all motifs of length ≥ 7 to either OR proteins, or to non-OR (NOR) proteins within all GPCRs. There exist 156 OR motifs of length ≥ 7 with hits on the 469 human ORs, and 2896 NOR motifs hitting the 148 human NOR proteins. There is no overlap between these lists, i.e. they are specific to either OR or NOR proteins.
While the number of human OR proteins (469) is larger than the NOR proteins (148), the number of the NOR motifs is much larger (2896 vs 156 of length ≥ 7) because of the large variety of modalities which are served by NOR proteins. Turning to all Swissprot GPCRs, for all organisms, we expect to find a clear separation between OR and NOR, as well as discover a very large number of protein-SP biclusters in the NOR GPCRs. We find 562 OR proteins, leading to 367 OR motifs with length ≥ 7, and 2481 NOR proteins with 3710 corresponding motifs. Once again, the two sets of motifs are specific to the two sets of proteins. We then proceed to search for protein-SP biclustering of the NOR data. The large number of NOR proteins and motifs allows for their sorting into clusters, as listed in Table 2. Both OR and NOR SPs will be referred to as GSPs. Their lists are provided as Json files in our github entry [9].  There exist 20 other sporadic hits of NOR GSPs on EC proteins, which are consistent with expected noise.

Zinc Finger proteins
We have analyzed 2582 Swissprot ZF proteins and extracted 1487 motifs of length ≥7 which are declared to be ZSPs. 786 of all the proteins are human ZF proteins, and they display hits by 1412 of the SPs.
Since ZF proteins may contain several ZF domains, we encounter reappearance of motifs on different locations within the same protein. This is different from our previous studies of EC and GPCR proteins, where inter-protein multiple appearances were responsible for the generation of MEX motifs. Here we find that intra-protein recurrences play an important role.
To illustrate this fact, we display in Table 4 some of the "popular" ZSPs, which have 100 or more hits on all human ZF proteins. On the right of the table we provide the sum of all shown hits for each protein as displayed here, and compare it to the total number of ZSPs hitting each protein. On the bottom we provide analogous counts for each ZSP.
It should be realized that SPs of length n can be contained within SPs of length >n, as can be seen in this table, which serves as an examples rather than a summary. Summary of all ZSPs and their hits on ZF proteins is provided in our github entry [9].

Summary and Discussion.
Our methodology is based on machine learning (ML) practices: MEX is an unsupervised tool for motif extraction; these motifs are then searched on protein sequences using supervised annotation to classify the results. In the case of enzymes, the classifier is the Enzyme Classification which is defined in terms of seven classes and four levels in each class.
ESPs are specific peptides whose presence on the amino acid sequence of the protein indicates its EC number, as well as the tree associated with it. This methodology was introduced in 2007 [1]. Other ML studies appeared in the meantime, trying to solve the same (or related) problems using various ML tools. Many neglected to notice that SPs can do the required EC prediction quite well, often even better than the new tools.
Some examples of recent ML methodologies are DeepEC [10] and MAHOMES [11]. DeepEC employs 3 deep convolutional neural networks and a homology analysis tool to the study of enzyme sequences. When applying it to a test set which uses 201 enzymes they obtained precision = 0.92 and recall = 0.455 (quoted from Table 2 in [10]). This is considerably worse than our results in Table 1, which were based on a much larger (25K) test set. Other five ML methods which they [9] compared themselves to, were even worse.
MAHOMES [11] uses a decision-tree ML model, which is structure-based, employing physicochemical features specific to catalytic activity. Their main aim is to classify metals bound to proteins as enzymatic or non-enzymatic, and they succeed doing it with precision=0.922 and recall=0.901. Comparing to sequencebased technologies, they find that DeepEC scores on their tasks are precision = 0.905 and recall = 0.596. They find that another homology method, EFICAz2.5 [12] (which lost to DeepEC according to [10]), had better statistics (precision=0.922 and recall=0.901) but still falls short of their own [11]. For an older review of ML studies of enzymes see [13].
Our precision/recall results attest to the usefulness of the MEX unsupervised methodology in discovering relevant and unique motifs, the specific peptides (SPs). Our approach is not limited to enzyme studies. We have demonstrated this flexibility by investigating GPCR and Zinc-finger proteins, leading to a wealth of novel SPs. We provide in [9] a documented python code which allows for SP searches of all the functionalities which we have studied. It contains the lists of 2,002 NOR GSPs, 351 OR GSPs and 1,482 ZSPs in addition to the 286,755 ESPs.

Building the list of ESPs,
In order to run the Motif Extraction program (MEX) [6], we divided the enzymes training set into batches grouped by joint level 2 assignments, and batches of enzymes with single level 1 assignments. Following [4] we restricted our MEX search to motifs of length ≥7 amino acids. The analysis led to 307,989 motifs. All motifs were then annotated after collecting the information of the IDs of enzymes hit by a particular motif (i.e. occurring in full on the amino acid chain of the enzyme) and how many times was a particular enzyme hit by a particular motif.
The EC number description, indicating both class and level, can be viewed as an inverted tree with a maximum depth of 4. For every motif, we map the EC numbers of the enzymes it hits on the training set onto a single EC tree. Starting from level 4 and moving upwards, we search for the first level which is a unique descendent of a higher level. The EC number of this unique descendant is assigned to the motif.
In order to remove motifs which may occur also on non-enzymatic proteins, we search for hits of all motifs on the non-enzymatic Ntrain set. Such motifs are removed from the list of specific peptides. Thus, to summarize, a motif of length ≥ 7 amino acids is labeled as an Enzyme Specific Peptide (ESP) if: -it hits (i.e. appears in full on the amino acid chain of) enzymes belonging to only a single EC class of Ptrain -and it does not hit any protein in Ntrain This procedure leads to the reduction of the set of motifs to 286,755 specific peptides which we label as ESPs. They are provided as a Json list in our github entry [9] which also includes the code for searching the sequence of a protein for the occurrence of such ESPs on its amino acid chain..