Systematic inference of regulation by protein kinases finds surprising level of transcription factor deactivation

Gene expression is controlled by pathways of regulatory factors often involving the activity of protein kinases on transcription factor proteins. Despite this well established mechanism, the number of well described pathways that include the regulatory role of protein kinases on transcription factors is surprisingly scarce in eukaryotes. To address this, PhosTF was developed to infer functional regulatory interactions and pathways in both simulated and real biological networks, based on linear cyclic causal models with latent variables. GeneNetWeaverPhos, an extension of GeneNetWeaver, was developed to allow the simulation of perturbations in known networks that included the activity of protein kinases and phosphatases on gene regulation. Over 2000 genome-wide gene expression profiles, where the loss or gain of regulatory genes could be observed to perturb gene regulation, were then used to infer the existence of regulatory interactions, and their mode of regulation in the budding yeast Saccharomyces cerevisiae. Despite the additional complexity, our inference performed comparably to the best methods that inferred transcription factor regulation assessed in the DREAM4 challenge on similar simulated networks. Inference on integrated genome-scale data sets for yeast identified ∼8800 protein kinase/phosphatase-transcription factor interactions and ∼6500 interactions among protein kinases and/or phosphatases. Both types of regulatory predictions captured statistically significant numbers of known interactions of their type. Surprisingly, kinases and phosphatases regulated transcription factors by a negative mode or regulation (deactivation) in over 70% of the predictions. Author summary In this work we addressed the challenging problem of inferring regulation by protein kinases and phosphatases via their activity on transcription factors. Although many protein kinase activity predictors have been developed for classes of protein kinases on specific amino acids within target sequences, our approach (PhosTF) provides predictions of regulatory activity for specific protein kinases and phosphatases on specific transcription factors. Our inference approach achieves this using the functional output observed in gene expression data of gene knock out stains, along with known transcription factor regulatory interactions. We formulated and tested a model for inference of regulation as well as a model for simulation of genes expression, transcription and translation. The simulation was used for in-silico validation of the inference method, which performed comparably to the best performers on simpler inference problem in the DREAM4 competition. The inference method was then applied to yeast expression data, with significant validation by known kinase/phosphatase interactions. Over 15300 novel regulatory interactions were predicted, suggesting that kinase activity provided a surprising level of repression of gene expression, either through the deactivation of activators or the activation of repressors.

regulatory interactions were predicted, suggesting that kinase activity provided a surprising level of repression of gene expression, either through the deactivation of activators or the activation of repressors.
level [4]. Inference of functional regulation can be achieved through the use of these diverse data 22 types as evidenced by a number of approaches used in the DREAM4 challenge [5]. 23 To this aim, methods for the inference of TF-based regulatory networks were evaluated in the 24 DREAM4 challenge [5]. To do this, mRNA concentrations were simulated with the software 25 GeneNetWeaver from a known network using differential equations describing mRNA and 26 protein concentration gradients [6]. In this way, all regulators can be deleted or overexpressed 27 (in silico) in turn and new steady-state mRNA output can be generated for each. However, 28 GeneNetWeaver does not take phosphorylation or other post-translational modification into 29 account. The focus of our approach was to extend the inference of TF-based regulatory networks 30 to include the activity of kinases and phosphatases, and to apply this method to the model 31 budding yeast Saccharomyces cerevisiae, which has been thoroughly mapped for protein 32 interactions. 33 The majority of efforts to infer regulatory networks have focused on TFs binding to promoter 34 regions of their target genes. These networks are often modeled as directed acyclic 35 graphs (DAG) of TF nodes interacting with nodes representing target genes. Applied in this 36 biological context, each node value represents a protein's concentration and each edge the direct 37 regulatory effect, or activity, from node to node. Modelling regulation as a DAG has limitations 38 on the extent to which graph edges represent causality and physical interaction, since gene 39 regulation is highly cyclic as target gene products are often regulators themselves. The linear 40 cyclic causal models with latent variables (LLC) approach was specifically designed to address 41 inference of causality in cyclic graphs and has been applied to infer TF regulatory networks [7]. 42 Methods have also been proposed that combine multiple likelihood functions for numerous 43 types of evidence [3]. In such cases, maximum likelihood ratios can be calculated for each 44 potential regulatory interaction (edge) by describing the likelihood ratios through factor graphs. 45 Inferring regulation from KPs is much more challenging since they do not regulate mRNA 46 production rates directly, but rather modulate protein activity of other potential regulators. 47 Although there have been many kinase prediction approaches (NetPhos,NetPhospan,48 PhosphoPredict) most have focused on the phosphorylation site prediction often without the 49 ability to identify which kinase was likely responsible for the phosphorylation. Recent studies 50 utilizing mass spectrometry based proteomics or phosphoproteomics have investigated the 51 activity of KPs in knockout studies [8] [9]. Though these studies endeavour to infer KP activity, 52 they do so relative to the effects on metabolic processes induced by environmental perturbations, 53 thus associating KPs to cellular responses rather than specific regulators. Approaches to infer 54 regulation between human protein kinase (PK-substrate regulatory interactions) has revealed 55 extensive circuits of kinases [10]. Their approach was based on an ensemble method combining 56 multiple kinase-substrate scores and included phosphosite data for specific kinases or kinase 57 families measured by phosphoproteomics, and gene expression data for quantifying 58 co-expression and co-regulation. The combination of multiple data sources was leveraged to 59 achieve high performance, but limits the method to organisms where extensive 60 kinase-phosphosite data is available.

61
In this paper we have developed PhosTF, which builds on the LLC method. PhosTF allows 62 for inference of indirect regulation caused by kinase and phosphatase activity on TFs. Although 63 KP regulation is difficult to infer from any single knockout, their activity can be inferred in

69
The inference method PhosTF was developed and tested on both simulated and experimental 70 data sets. Initially, small-scale simulated data sets were used to illustrate the inference 71 performance, where the regulatory interactions were known. Subsequently, PhosTF was tested 72 on a large compendium of experimental data collected for budding yeast, S. cerevisiae.

73
PhosTF was tested on defined regulatory networks and their simulated output in order to 74 allow for inference where the regulation was known and to allow for the assessment of 75 performance. Given the right parameters, and in particular the appropriate regularization 76 strength, PhosTF was able to correctly infer all edges on the vast majority of the small 5-10 node 77 networks tested. Initially, data was simulated by iterating the PhosTF model equations Eq (3), in 78 which case it was possible to also infer the exact edge weights (within rounding error), which 79 defined the regulatory effects of one gene on another. The way in which regulation is inferred is illustrated in the total effect graph shown in 91 Fig 2G, which contains the sum of all paths from one node to another [11]. If there is only a 92 single sample of each knockout experiment, then the total effect from node j to i is simply where superscript indicates the knockout. In this case the total effects from 94 node KP 1 and KP 2 have become indistinguishable (as seen in Fig 2D,  where LLC had an overall AUC=0.76 [11], and an AUC=0.83 was achieved for the best 120 performance on the original 100-node networks by team "ALF" [12].  Edge inference assessment on simulated networks (A) with complete knowledge of "true" interactions, and on yeast (B and C) where very limited examples were known. ROC curves illustrating edge inference performance pooled from all 25 simulated networks, each containing 100 nodes. Different types of regulation were assessed independently as indicated in the legend. Gene set 'V' refers to any gene. Area proportional diagrams for yeast inference (B). Performance on evaluation set of edges visualized for ∼80 substrates/KP. Fisher's exact test p-values represent the chance of observing the performance by random chance. Odds ratio for the number of shared GO Biological Process (BP) terms between interacting genes (C) based on the number of shared GO terms for inferred compared to uninferred edges. Dashed lines show the 95% confidence intervals of the odds ratios.
transcription factors is well studied in S. cerevisiae (>20,000 interactions), the number of known 126 protein kinase and phosphatase (KP) regulatory interactions on transcription factors, or between 127 KP proteins is very small (412 and 694 interactions respectively).

128
In an effort to focus on the inference of KP regulation, regulatory networks were inferred for 129 KP interactions only. For this reason, only weights for the known TF regulatory interactions  These results were compared to predictions made by the protein sequence based substrate 141 prediction method NetPhorest [13]. NetPhorest included 33 protein kinases which could be  The regulatory interactions inferred by PhosTF were also evaluated relative to shared Gene 148 Ontology terms of regulator-target pairs. The odds ratio of having one or more shared GO slim 149 Biological Process terms (GO(BP)) for the source and target genes of an inferred edge terms. For example, the odds ratio increased to 2.05 for d(KP, TF) and 1.67 for d(KP, KP) edges 153 when source and target genes shared 4 or more GO terms. All odds ratio estimates were outside 154 the standard 95% confidence interval (CI), which implied a biological significance to both types 155 of predictions with respect to capturing regulatory relationships between proteins functioning 156 together in known biological pathways.

185
A number of node (vertex) sets were defined that represent the potential regulatory role of a 186 gene, or its ability to be modeled by this regulation (Table 1) The following are the difference equations used to model the node attributes as a function of 189 discreet time steps.
x i (t) represent the relative mRNA concentrations, specifically log 2 fold-change mRNA 191 concentration for a mutant relative to wildtype. Equivalently, y i represents the relative regulatory 192 activity of node i, and represents an extension of the Linear, Latent, Causal (LLC) model [11]. 193 This latent variable represents the effects of the phosphorylation state, but could in principle 194 represent any post translational modification. Since phosphorylation can either activate or 195 deactivate a regulator, it can be influenced by either kinases or phosphatases.

196
The edge value, w i j , defines the influence from node j to node i in a directed graph that may 197 have cycles. As in the previous work, self-loops are avoided by enforcing w ii = 0. a j (t) is a 198 function of x j (t) and y j (t), which has to be defined in a meaningful way to combine the node 199 concentration and activity attributes. e (x) i and e (y) i captures any latent concentration and activity 200 contributions not explicitly mediated by the nodes in the network. In this study, 201 a j (t) = x j (t) + y j (t) (see Derivation of Inference Model in S1 Text), and the equilibrium 202 equations simplify to the formulation in LLC if a j (t) = x j (t).

Intervention experiments 211
Intervention experiments are gene perturbations, where nodes in the model (representing genes and their gene products) are knocked out or over-expressed by changing their concentration parameter. In an experiment k, one or more nodes in J k (though typically one) are intervened on by setting where c k is a constant intervention, and this particular subscript notation indicates subsetting of perturbed nodes (only). The combined expression for both intervened and passively observed nodes becomes: x k are node values for experiment k which is defined by a set J k containing indexes of 212 intervened nodes. Multiple samples can be collected of each experiment k, however data is often 213 limited to a single sample. A knockout is not affected by transcription regulation so edges in 214 W TF onto nodes in J k are removed by U k = (u ki j ), which is a diagonal matrix with ones 215 indicating passively observed nodes, and zeros indicating intervened nodes (u kii = 0 for i ∈ J k ). 216 The intervention term, c k , contains zeros except for (c k ) {J k } which are set to the log fold-change 217 values measured in perturbation data. Again, e k represents noise and other latent effects.

218
Loss functions 219 W can be inferred by minimizing e k for all experiments simultaneously with L 2 regularization. 220 where column k in X is x k , column k in U is the diagonal of U k , and the norm is entrywise.

221
Minimizing SSE without regularization will result in a noisy solution. It is often corrected with 222 an L 1 penalty on the trainable weights, which are the edges in W. Regularization was performed 223 on B instead of W, as regularization of W caused unbalanced inference. B was defined by the 224 following: where W TF and W KP hold absolute elements of W TF and W KP . The solution was then formulated 226 as: where the norm is entry-wise. All results were found using AdamW gradient descent [14] and 228 regularization hyperparameter λ = 0.1.

229
Restricting the solution space 230 The solution space can be reduced with information about the proteins. M TF was used to 231 disallow certain TF edges, those which lacked evidence, by setting the appropriate elements of 232 M TF to zero. The same was applied for M KP , where rows were set to zero for regulators which 233 did not have evidence for at least one phosphorylation site.

234
If the mode of regulation is known for TF-target gene pairs, then the degrees of freedom for the solution can be reduced using Data for benchmarking network inference is simulated with differential equations describing concentrations of mRNA r i , protein p i and activated protein ψ i in a cell. (1) in S1 Text is a nonlinear function modelling transcription regulation taking into

Modeling of transcription regulation 245
For the simulations performed by GeneNetWeaverPhos, the proportion of maximum transcription is used to model the regulation of a gene. For gene i, the function f i (ψ) uses the amount of activated regulators to estimate this proportion given the regulatory inputs to gene i defined in the network. The way in which information from multiple regulators was integrated is described in detail in Modeling of transcription regulation in S1 Text. Regulator concentrations for each regulatory module were combined using a generalization of the Hill equation. In the special case of a module with a single regulator, the expression for µ m from Eq (1) in S1 Text simplifies to a standard Hill equation for either an activator Eq (9a) or repressor Eq (9b).
Here, ψ is the concentration of the transcriptional regulator which is able to bind the DNA, k is a 246 dissociation constant, and ν is a parameter that shapes how binding sites respond to regulator 247 saturation.

248
Network construction for simulation 249 Five adjacency matrices from DREAM4, were each used 5 times to create 25 random adjacency 250 matrices each with 100 nodes (see Generation of random adjacency matrices given to 251 GeneNetWeaverPhos in S1 Text). Fully defined networks were then randomly generated with 252 GeneNetWeaverPhos, which could subsequently be used to generate (simulated) log fold-change 253 values. In the random networks, protein kinases and phosphatases were encoded as KPs which 254 can both regulate positively and negatively.

255
Yeast data 256 Many types of data were collected for inferring a regulatory network for yeast and for evaluating 257 the performance of said inference, in an effort to validate PhosTF.  (Table 2).

261
"Tech." refers to the technology or type of experiment performed to obtain the relative the data set. Merged edge data was filtered by matching source and target nodes against mutated 279 and measured genes from the perturbation data. "Entries" shows the number of measurements, 280 and "Edges" is the number of edges after filtering by the sets KP, TF, and V (see Node Sets 281 in Network Construction). Predictions scores were collected from NetPhorest using the provided 282 reference yeast genome. The edge value "binding" refers to published binding evidence and    were curated for evaluation purposes. All GO resources were from AmiGO2 version column describes the interpretation of each GO term, and "Class" shows the filtered "GO class 306 (direct)". In the case of Biological Process GO terms, 100 different terms were possible to test. 307 All AmiGO2 annotation queries were filtered by organism "Saccharomyces cerevisiae S288C". 308 "Entries" shows the number of entries for each query and "Proteins" shows the number of 309 proteins with at least one entry.

310
Protein modes of regulation were classified according to curated GO evidence. GO evidence 311 based on computational predictions alone was not considered in assigning regulatory modes.

312
Low-throughput and direct experimental evidence was trusted over high-throughput and indirect 313 evidence. This curation resulted in 191 TF activators, 65 TF repressors, 146 protein kinases, and 314 51 protein phosphatases.

315
Yeast regulatory network construction 316 The data was processed to create an initial genome-scale regulatory network that was used for 317 KP regulation inference. Subsections are ordered chronologically.  represented the mapping between manipulated and measured genes.

332
Perturbation effects were enhanced (see Enhancing relative expression of genetically 333 perturbed genes in S1 Text). KOs genes were adjusted by −4, which results in an average KO 334 gene level ∼100 times less than wildtype, and similar to values stable in simulation. OE genes 335 were adjusted by +1, which results in an average expression ∼4 times wildtype levels.  The mode of regulation for each edge was either assigned from the above listed sources, or 350 inferred from the mode of regulation assigned to the TF. Edge weights were initialized as −1 or 351 +1 depending on mode of regulation. All other weights were set to zero and treated as invariant 352 during training. The variable (trainable) weights w i j for v i ∈ KPTF and v j ∈ KP were initialized from a normal 355 distribution with a small variance σ 2 = 10 −4 . Initial w i j for v i ∈ KP and v j ∈ KP were sampled 356 randomly, however w i j for v i ∈ TF were informed by Wilcoxon rank tests on KP perturbation 357 data. Absolute log fold-change values for each KP with knockout data were compared for each 358 TF with a one-sided Wilcoxon rank test. The tests compares absolute measurements from two 359 groups of genes; the TF regulon and remaining genes. Significant p-values from these tests 360 indicate which KPs have influence on TF regulons. Instead of random sampling from a normal 361 distribution, values were selected from a normal distribution for the quantile of the p-value. As a 362 result, smaller p-values corresponds to larger |w i j |. 363 sgn(w i j ) for KP j on TF i were initialized from equivalent two-sided Wilcoxon tests. edges. Performance could furthermore be assessed using GO process terms since such data was 389 also not used in the inference process.

390
The evaluation data set was the union of d(KP, KPTF) edge data sets excluding NetPhorest, 391 and filtered for substrates with a recorded phosphorylation site (see Node Sets in Network 392 Construction). From STRING, validation interactions were only included where evidence that a 393 kinase phosphorylated a target protein with a protein modification (PTMod) score > 250 were 394 included. Other than PTMod, all other lines of evidence from STRING were ignored. YeastKID 395 was filtered with threshold score > 4.52, corresponding to p < 0.05, which added 412 396 interactions to the validation set.

397
Top scoring pathways were collected using thresholds of ≥ 1 to ≥ 6 shared GO terms 398 between KP and TF from GO slim biological process terms ("Pathway" in Table 4 The prediction performance of PhosTF on simulated 100-node networks was either comparable 413 to or better than the performance from simpler simulation models applied in the DREAM Counts of inferred edges for each combination of regulation mode for d(PK, TF) or d(PP, TF) to either a transcriptional activator or repressor. Counts statistically larger than expected are marked with asterisk. The top scoring d(KP, TF) edges with shared OG terms are shown in (B). The number of GO terms shared between KP and TF shown next to each edge. Candidate edges have large absolute weights selected using various thresholds for the number of shared GO terms. Dotted lines indicates edges present in the evaluation set, i.e. known regulatory interactions (true positives). Boxes represent regulons for the TFs and are labeled with representative significant biological process GO terms or the process the TF is known to regulate. The size of the box represents the number of genes in the regulon.
performance to the best of DREAM4 should be viewed as advancing the state-of-the-art for this 418 more challenging inference problem.  It is tempting to assume this bias was due to systematic aspects of the gene expression data 448 for the various knock-outs used for inference, i.e. where differentially expressed genes were Naturally, deactivation of a node with activity 0 will not be detectable, and vice-versa for 481 activation of a node with the maximum activity 1. Some silent regulation is expected to be 482 present in the biological data as well. For example a knockout can affect multiple regulatory 483 pathways, which competitively regulate a shared target. The overall effects can cancel each 484 other out, leaving both pathways unobservable. Some steps were taken to avoid silent regulation 485 in the simulations, e.g. setting the magnitudes lower for edges sharing the same 486 target (see Generation of random adjacency matrices given to GeneNetWeaverPhos in S1 Text). 487 Silent regulation is not always avoided, which means some regulation cannot be inferred.

488
In addition, some regulatory effects can be silent due to compensatory pathways.

489
Compensating signal transduction cascades are difficult to infer from perturbation data if only a 490 single gene has been deleted from either cascade. The other cascade can compensate for the 491 missing regulation, resulting in minor or indistinguishable changes in the expression levels of 492 regulated targets. This limitation can be overcome with multiple knockouts in the same 493 experiment, however most of the collected data was from single gene knockouts.

494
Information attainable from genetic perturbation 495 The experimental conditions used for assessing the effects of genetic perturbations also has an 496 impact on inference. This is because certain regulatory signal cascades will only be activated 497 under specific environmental conditions or stresses, and as such will be difficult to detect from 498 mutants measured under nominal conditions. Measurements from different conditions have been 499 averaged, but many other approaches exist, such as averaging separately for each condition, or 500 keeping all original experiments. Many genes are essential for survival, requiring knockdowns 501 or conditional knockouts, further altering the experimental conditions. However, the data used 502 here were from different labs, and in some cases included only smaller subsets of genes, as 503 opposed to genome-wide profiles, which would result in a high number of missing values if not 504 averaged with other replicates.
PhosTF inference relies on information from the genetic perturbation data, which is only 506 indirect evidence of the protein regulatory interactions that are inferred. Ambiguous solutions 507 can arise even in simple network models using simulated perturbations. Fig 2 illustrates a case 508 where two kinases, KP 1 and KP 2 , have identical influence (Fig 2B), which makes it impossible 509 to infer if one regulates the other or if both regulate TF 1 , only from observing perturbation data. 510 When multiple solutions exists the most simple is inferred, where simplicity is rewarded through 511 regularization. Usually the L 1 norm is applied to induce sparsity among trainable weights, 512 however this assumes each weight can be treated equally. In LLC, influence from a node 513 propagates through the network as a series of multiplications of edge values w i j , see Eq (1b). It 514 usually holds |w i j | < 1, which helps convergence but also causes the decay of a signal along a 515 cascade. As a result, sparsity may suffer when multiple shorter cascades are preferred over 516 fewer, but longer, cascades.

517
To induce sparsity, edges can instead be penalized based on their final influence on gene 518 expression rather than on their influence on the next node in a cascade, which is accomplished 519 by regularizing B * instead of W in Eq (6). From testing on small simulated networks, it was 520 found that PhosTF performs much better when the regularization was applied to B * instead of W, 521 where regularization is typically performed. This is likely because L 1 regularization of W Although edges were inferred based on the score |w i j |, this score does not readily describe a 528 probability for the presence of a protein-protein interaction. A weak interaction could be 529 accurately inferred but discarded for its low |w i j |. Alternative scoring methods can be considered 530 to estimate the probability of edge presence, for instance running PhosTF multiple times and 531 taking the variation of w i j into account.

532
PhosTF can be combined with other methods applying evolutionary information, where 533 knowledge of one species is transferred to another related, but less studied species [35]. A cost 534 function for evolutionary distance can be incorporated into an ensemble method. Likewise, other 535 data types could be considered for training, e.g. other functional knowledge or using the 536 evaluation data set, which could improve the inference performance instead of simply evaluating 537 it.

538
The simulation method GeneNetWeaverPhos has potential future use in inference 539 approaches. Given enough computational resources, GeneNetWeaverPhos could be iteratively 540 run with variations to the network structure, to minimize the difference between simulated and 541 experimental gene expression levels. This can be accomplished with a Markov chain Monte 542 Carlo approach such as the Metropolis-Hastings algorithm.

543
Supporting information 544 S1 Text. Method details. GeneNetWeaverPhos simulation details and derivation of inference 545 model.

546
Data and code availability 547 All source code, data and results are available online at github.com/degnbol/PhosTF.