Data-driven extraction of human kinase-substrate relationships from omics datasets

Phosphorylation forms an important part of the signalling system that cells use for decision making and regulation of processes such as celll division and differentiation. To date, a large portion of identified phosphosites are not known to be targeted by any kinase. At the same time around 30% of kinases have no known target. This knowledge gap stresses the need to make large scale, data-driven computational predictions. In this paper, we have created a machine learning-based model to derive a probabilistic kinase-substrate network from omics datasets. We show that our methodology displays improved performance compared to other state of the art kinase-substrate predictions, and provides predictions for more kinases than most of them. Importantly, it better captures new experimentally-identified kinase-substrate relationships. It can therefore allow the improved prioritisation of kinase-substrate pairs for illuminating the dark human cell signalling space.


Introduction
Cells relay information through intricate signalling networks 1 that are typically regulated by post translational modifications (PTM), with phosphorylation being the best studied one among these 2 .
Phosphorylation of proteins is catalysed by kinases that target a specific set of substrates, changing their state and/or function. The importance of understanding kinase regulatory networks is highlighted by the fact that a large portion of targeted therapies that are currently used and being developed target kinases 3 . Despite this, to date only a small portion (~5%) of the more than 100,000 known phosphosites have a known upstream kinase 4 . At the same time 150 kinases have no known substrate and 90% of the phosphosites are assigned to the 20% of the best studied kinases 4 . Given that several studies have demonstrated that the understudied kinases can be just as important for health as the well-studied ones 4 , this points to a bias in the literature, where researchers prioritise well-studied proteins. In addition, publicly available databases, such as KEGG 5 , Reactome 6 , Omnipath 7 and others, that describe our current knowledge of cell signalling pathways, present a static view that represents the 'average' cell and doesn't capture condition-specific signalling networks. As a result these literature-defined static signalling pathways have a limited explanatory value when it comes to the analysis of phosphoproteomics data.
Mass spectrometry-based phosphoproteomics data sets present relatively unbiased views of a cell's signalling state and could be used for extracting unbiased signalling networks from specific experimental conditions 8,9 . However, most well-performing methods that have been developed to this end, base their predictions on prior networks 8 in the form of known pathways mentioned above thus perpetuating existing literature biases. For example, PropheticGranger, one of the top scoring methods in the HPN-DREAM challenge 8 , applies heat diffusion coupled with L1 penalized regression 10 on a network based on the Pathway Commons database 11 . There is thus a need for a more unbiased prior network to be used in methods for network inference of signalling networks from phosphoproteomics data.
In previous work, we used a machine learning approach, combining predictors such as cophosphorylation, co-expression and kinase specificity models to derive such a data-driven kinasekinase regulatory network 12 . However the resulting network provided predictions only at the protein level, whereas it is known that proteins have multiple functional phosphosites, often with entirely different or even opposing functions. E.g. phosphorylation of Y530 found on the Src kinase leads to its inhibition while dephosphorylation of Y530 and phosphorylation of Y419 leads to its activation 13 . In addition, while kinase regulatory networks form the 'skeleton' of the phosphobased signalling networks, non kinase substrates are also critical e.g. in the case of adaptor proteins forming scaffolds to regulate signal propagation 14 , for phosphatases removing phosphosites to shut signals off, regulation of transcription factor translocation or activities 15 and others.
To address these points, in this work, we have extended this machine learning model introducing additional predictors to include non-kinase substrates and to provide predictions at the phosphosite level. We validate our resulting predictions with recent, independent, experimental kinase-substrate predictions and present a list of highly probable novel kinase-substrate relationships 16,17 . Our method, called SELPHI2.0, is able to make predictions for less studied kinases and substrates compared to those found in the literature and perform better than established kinase-substrate prediction methods [18][19][20][21][22] . Furthermore, we provide predictions of the sign of the kinase-substrate interactions. Our network's precision, finally, increases when fitted to experimental data, suggesting that it can be used as a prior in data-driven inference of cell signalling networks from phosphoproteomics data.

Generation of kinase-substrate probabilistic network
Information on known kinase-substrate relationships as well as a list of phosphosites and the amino acids sequences surrounding phosphosites was gathered from PhosphoSitePlus 23 Predictions were made between 368 kinases, for which we had previously generated Position weight matrices (PWMs) 12 , and 80,234 phosphosites found on 9,180 proteins that were listed in PhosphositesPlus 23 and had a functionality score 24 assigned to them.
As a positive training set we used 5,251 kinase-phosphosite relationships extracted from PhosphositePlus 23 (Downloaded 2. May 2021). As there are no databases with information on known negative kinase-substrate pairs and given the fact that biological networks tend to be sparse, a random sample of kinase-substrate relationships ten times as large as the positive set was used as a negative training set. For the prediction of the regulatory sign of kinase-substrate relationships we used 673 activating interactions and 497 inhibiting interactions extracted from the SIGNOR 38 database and focused only on phosphosites likely to lead to functional changes in the protein, i.e. those with a functional score >0.5 24 .
We created and acquired 49 predictors considering different features that could affect a kinasesubstrate relationship from co-regulation in high throughput datasets to match of a phosphosite to a kinase specificity model as described by position weight matrices 12 (Supplementary table 1).
In the case of kinase substrate prediction,to select the most useful predictors we evaluated the performance of different feature combinations using 100 training/testing datasets as described in For training the model we used the random forest algorithm 28 as implemented in the scikit-learn python library 29 was trained with the positive set and a random sample one hundred times.
Parameters were tuned in each run with grid parameter search. Different parameters were considered at each run which were listed in the section above and for each of the one hundred runs the best parameter sets were used (Supplementary Information). Each model was validated by using ten fold cross-validation. In order to balance the training and test set, a stratified K-fold split as implemented in the scikit-learn 29 python library was used to keep the portion between negatives and positives in each split the same. The average probability across the different outputs was then calculated to assign probabilities to kinase substrates.
To further evaluate the performance of our model on novel kinase-substrate relationships we used predictions from two recently published studies 16,17 and tested whether these relationships were assigned a higher probability by our method. To quantify the predictive power of our model in each case, the area under the ROC curve was calculated as implemented in the ROCR

Comparison with other methods
To compare SELPHI2.0 with other state-of-the-art peer reviewed methods we assessed how well our method predicted known annotated kinase-substrate relationships (see positive/negative training sets described above) and also novel kinase-substrate relationships supported by experimental procedures. These were derived from the recent work of Sugiyama and colleagues 23

Evaluation of model fit to phosphoproteomic data
In order to capture context specific signalling networks, we constructed a reference signalling network by linking the kinase-substrate predictions made in this work to a backbone of a probabilistic kinase kinase regulatory network that we had previously published 12 . A probability cutoff of 0.5 was applied on both networks to select high confidence edges.
High-throughput mass spectrometry-based phosphoproteomic data that had been compiled and analysed by Ochoa

Counting number of citations related to proteins
To count the number of citations related to kinases and their substrates we used the Entrez.elink() function from the Biopython module 32 . We searched for related articles in the Pubmed database.
Linked publications were retrieved from NCBI Entrez Gene 33 database and publications that mention more than ten kinases were filtered out.

Generation of a probabilistic human kinase-substrate regulatory network
To create a probabilistic network of kinase-substrate relationships we used the random forest algorithm to combine various predictive variables ranging from co-phosphorylation and coexpression in large datasets, to kinase specificities and features related to the functionality of the phosphosites as described in the Materials and methods section. By running the prediction algorithm 100 times we achieved an average AUC of 0.96 ( Figure 1A).  Figure 1B). These predictions include substrates that have not been mentioned in the literature before and have no known upstream kinase, establishing the value of this network as a prior to explore the less studied part of the phospho-signalling network.

SELPHI2.0 can be used to identify high confidence kinase-substrate relationships
To assess the ability of our method to predict novel kinase-substrates, we looked at how well we discerned between a set of experimental kinase-substrate predictions not hitherto found in the literature and the rest of the unsupported predictions. To this end, we used experimentally predicted kinase-substrate relationships from two recent papers 16,17 . In short, one publication introduces kinases to dephosphorylated peptides from cell lysis while the other data set predicts kinase-substrates based on changes in phosphorylation following kinase inhibition.
We found that the overlap between the three sets, the two experimental sets and the SELPHI2.0 predictions, was relatively low ( Figure 1C). Due to the difference between the two experimental methods we reasoned that kinase-substrate relationships identified by both studies should have higher levels of confidence. Our method assigned significantly higher probabilities to experimentally supported edges compared to the rest in the network ( Figure 1D). Furthermore, we found that kinase-substrate relationships supported by edges found in both data sets had an even higher probability assigned to them ( Figure 1D) compared to edges supported by either.
The complete list of kinase-substrate predictions with indicators can be found in Supplementary

Prediction of regulatory status of kinase-substrate relationships
To our knowledge, earlier kinase-substrate prediction methods have not predicted the sign of the relationships. In vivo, phosphorylation may lead to functional changes in the phosphorylated protein. We wanted to capture this behaviour of kinase-substrates by predicting the sign of their interactions (Materials and methods). For this prediction we selected highly functional (functional score > 0.5) phosphosites. We found that our classifier was able to discriminate between kinasesubstrate relationships that lead to activation or inhibition of the substrate with AUC of 0.83 from 10-fold cross-validation. Importantly, due to the fact that the SIGNOR training set had a large portion of the signed relationships between kinases we looked at how well the classifier discerned between kinase non-kinase-substrates and saw only a modest decrease with AUC of 0.80 from 10-fold cross-validation ( Figure 3). Extraction of dataset-specific networks from our prior network selects for known interactions We fitted our network to a set of mass spectrometry-based global phosphoproteomic data sets generated under different conditions. We used a compilation of mass spectrometry data sets compiled and reanalyzed downloaded from an earlier publication 34 . To fit our predictions to highthroughput data, the predicted kinase-substrate edges were combined with a kinase-kinase regulatory network that we had generated previously 12 , forming a network of kinase-kinase regulatory relationships with phosphosites as nodes without outgoing edges (Materials and Methods). In order to select high confidence edges, we used edge probability of 0.5 as a threshold for both the kinase-kinase regulatory network and the kinase-substrate predictions.
To fit the combined network to the high-throughput data sets, we used Prize collecting Steiner's forest as implemented in the R package PCSF 31 . We found that by optimizing the edge cost against the node prizes we were able to select for edges found in the literature. The  Our method, called SELPHI2.0, performs better than the five state-of-the-art methods tested [18][19][20][21][22] , not only on the benchmark set but importantly on entirely new experimentally supported data 17 . This is true despite including predictions for more kinases than most other methods except for GPSv5.0 20 (479), which however performs worse on the benchmark dataset and close to random in the prediction of new experimentally supported relationships (Fig. 2). This suggests that our network is a good starting point for prioritisation of new kinase-substrate relationships. When overlaying our predictions with the two experimentally supported datasets, the resulting 24 high confidence kinase-substrate relationships are supported by both in vitro and cell line-based experimental data, and our data-driven machine learning approach, giving them particularly high confidence. In addition, no other kinase-substrate prediction method, to our knowledge, provides signed relationships. We provide these for both other kinases and non kinase substrates, making the network suitable for use as a prior not only in standard network inference method, but also in studies of mathematical modelling of signal transduction.
Although we use kinase specificity in the form of PWMs as a predictor, which is a metric that inevitably is somewhat biassed by the literature, most predictors were based on unbiased, high throughput datasets including global phosphoproteomics datasets. This, in combination with the coverage of the dark space that our method affords, is a step towards reducing the bias in cell signalling studies. NetPhos 21 proved to be an exception, performing well both at discerning between known positives and negatives as well as experimentally validated edges. It should be kept in mind though that NetPhos only makes predictions for 17 kinases.
Our network provides much wider coverage of the kinase signalling space than current knowledge and most other available kinase-substrate prediction methods and is more accurate.
With the median citation number for both kinases and substrates in the network being more than 5 times less than the PhosphositePlus database, our network forms a springboard for the exploration of the dark human cell signalling space. Given the importance of a prior network in methods of signalling network inference, this network will significantly contribute to a better understanding of the role of the understudied space in the context of our current knowledge, and will allow methods to generate networks that more accurately reflect the data.
Accumulation of false positives is a persistent problem when kinase-substrates are predicted. This is partly due to the large fraction of the signalling network that is currently unknown or understudied but a fraction of these predictions can be assumed to be false positives, even though this can't possibly be confirmed. This is likely true for our method as well and users of the network should take this into consideration when interpreting the results of their analysis. Nevertheless, the vast understudied signalling space requires computational approaches for prioritisation of hypotheses and the performance of our network compared to the state of the art indicates that it provides a good starting point. As more high quality high throughput phosphoproteomic data sets become available our model can be further improved.
We expect that SELPHI2.0 will allow the improved prioritisation of kinase-substrate pairs for illuminating the dark human cell signalling space, both through smaller scale signalling studies and through acting as a relatively unbiased prior in signalling network inference approaches.