Abstract
Signaling pathways represent parts of the global biological network which connects them into a seamless whole through complex direct and indirect (hidden) crosstalk whose structure can change during normal development or in a pathological conditions such as cancer. Advanced methods for characterizing the structure of the global directed causal network can shed light on the mechanisms of global cell reprogramming changing the distribution of possible signaling flows. We suggest a methodology, called Googlomics, for the analysis of the structure of directed biological networks using spectral analysis of their Google matrix. This approach uses parallels with quantum scattering theory, developed for processes in nuclear and mesoscopic physics and quantum chaos. We introduce the notion of reduced Google matrix in the context of the regulatory biological networks and demonstrate how its computation allows inferring hidden causal relations between the members of a signaling pathway or a functionally related group of genes. We investigate how the structure of hidden causal relations can be reprogrammed as the result of changes in the transcriptional network layer during cancerogenesis. The suggested Googlomics approach can be useful in various contexts for characterizing 11011-intuitive changes in the wiring of complex and large causal biological networks.
1 Introduction
The network biology point of view on signaling pathway as a part of complex integrated molecular machinery consists in considering it as a subnetwork embedded into a global molecular network. As a consequence, all properties of the pathway functioning depend on the network context to which it remains connected. Considering only the set of direct causal relations between pathway members (as is frequently the case) neglects the indirect effect of the global biological network changes which may significantly re-wire the pathway topology by introducing implicit (hidden) causal relations. Characterizing such influence of the global network structure on the local network properties and dynamics remains one of the major challenges of systems and network biology [3]. A number of empirical and pragmatic approaches have been suggested recently to address this question [16,26,33,12,17].
Reconstructions of the global directed causal signaling network structure have appeared only recently in the form of comprehensive molecular interaction databases such as SIGNOR[42], SignaLink [22], where the pair-wise relations between molecules are oriented. Large-scale cell type-specific reconstructions of transcriptional causal networks have become possible thanks to appearance of Chip-Seq technology [23,25] or advances in computational methodology of transcription factor binding site predictions combined with the data on chromatin accessibility [39].
In the previous works, quantification of indirect interactions (sometimes called influences) between pathway members mainly exploited the calculation of shortest or second shortest paths (the paths that become shortest after removal of an edge in a shortest path), following quantification of the balance between negative and positive path signs [17,7]. The limitation of such approaches is, however, in that they do not take into account the complex global structure of the network: multiple dense causal connections between two nodes might be more important than a single shortest path connecting them, representing a hypothetical sequence of intermediate regulations.
Global changes in the structure of the global network might effectively re-wire a signaling pathway even if its direct interactions are weakly dependent on the biological context. For example, the functioning of a signaling pathway must be heavily affected by the structure of transcriptional feedbacks indirectly re-wiring the pathway structure by gross effect of implicit (hidden) causal relations.Therefore, changing the transcriptional layer in the global network effectively rewires many signaling pathways even without affecting the structure of direct connections between its members. Therefore, it would be advantagcous to develop an efficient and rigorous mathematical formalism allowing quantifying such a phenomenon. In this paper, we suggest a candidate methodology for this purpose.
We assess the characteristics of signal propagation throng’ a pathway by considering the stochastic Markov process of random walk with uniform non-zero restart (téléportation) probability along oriented edges of the graph representing the global biological network. This process is described by Google matrix (see Materials and Methods section). and its stationary state defines PageRank centrality measure of the graph nodes. More complete and subtle description of the process can be obtained by looking at the complete (complex) spectrum of the Google matrix, which might reflect complex non-stationary properties of the random walk. For example, grouped eigenvalues of the Google matrix in the complex plane can define weak communities in the graph where the signaling flow (random walk) can be “trapped” for a finite time. In order to quantify indirect hidden causal relations between the members of a pathway, we introduce the original formalism of reduced Google matrix, based on decomposing the global Google matrix into the parts describing the pathway itself and the influence of the rest of the network.
To our knowledge. Google matrix approach or related ideas have been applied before only to undirected networks. in order to find activated network modules or to smooth the high-throughput data [43,29,32]. establish connection of genes to diseases [45,31]. improve interpretability of genome-wide analyses [40,36,34] and compute network-based cancer biomarkers [46]. The formalism of reduced Google matrix is applied for biological networks in this paper for the first time.
We describe the details of the Google matrix and reduced Google matrix methodology in the “Methods” section. and in the “Results” section application of the methodology to several large regulatory networks is documented. Using the suggested approach, we quantify the effect of the changes in the structure of transcriptional network as a result of oncogenic events during chronic myelogenous leukemia onto rewiring connections between proteins in several cancer-related groups of genes. We conclude that the method is able to infer the missing indirect causal relations between the members of a pathway and can detect events of hidden re-wiring during cancerogenesis.
2 Results
2.1 Used networks and case study description
In order to illustrate the application of Google matrix approach to studying oncogenic changes in the global and local network structures, we constructed two large directed networks describing global signaling in a leukemia cancer cell line K562 compared to a healthy cell line GM12878 derived from normal B-lyniphocytcs. The transcriptional networks of these two cell lines have been previously characterized [25]. using systematic Chip-Seq experiments on a number of transcription factors whose activity is detected in a given cell line. Transcriptional networks for GM12878 and K562 cells have been previously analyzed in order to estimate their structural properties which can lead to buffering and robustness [2]. It was demonstrated that the wiring of the transcriptional network in cancer leads to significant changes in the number of structural patterns leading to violating the network robustness properties. In order to deal with combined signaling+transcriptional networks, we merged each of the transcriptional network to the global reconstruction of signaling taken from the SIGNOR database [42] (version from February 2016). Therefore, as a modeling assumption, we assume that the structureof the global signaling network does not depend on the biological context while the transcriptional regulation layer undergo significant changes leading to indirect effect on the signaling.
2.2 Biological interpretation of PageRank and CheiRank centrality measures and their changes in cancer
2.2.1 Distribution of proteins on PageRank vs CheiRank plane
For the three directed biological networks described above (SIGNOR alone and two merged signaling +transcriptional regulatory networks), we applied the Google matrix methodology as described in Materials and Methods section, and determined the values of PageRank and CheiRank for all proteins. The distribution of proteins in PageRank vs CheiRank plane is shown in Figure 1. This figure shows that in the case of these networks PageRank and CheiRank measures are not correlated which reflects quite distinct biological role of proteins with many incoming and with many outgoing directed interactions.
It can be easily demonstrated (data not shown) that most of the proteins simultaneously having high values of PageRank and CheiRank (such as AKT1, NOTCH1, CTNNB1, TP53, CDKN1A, ATM, MAPK3, CDK1, EGFR) play an important role in cancer biology. This is, however, a rather trivial observation having in mind that it is known that hubs of the protein interaction networks frequently correspond to cancer-related genes [30,3].
One can also observe that adding a transcriptional network to a signaling network (SIGNOR) significantly changes the top ranked proteins for CheiRank but not for PageRank. This is consistent with the fact that the transcriptional networks are characterized by fan-like structures in which a transcription factor can regulate many (hundreds) of proteins, while the cases when a protein is regulated by so many upstream regulator are relatively rare.
Overall, the general shape of the distribution of proteins in the “normal” GM (Figure 1,middle) and “cancer” K562 (Figure 1,right) networks is similar; nevertheless, there are differences. For example, it can be seen that the top Chei-ranked proteins are not the same in these two networks. There is a region in the PageRank vs CheiRank plano for GM network (Figure 1,middle) occupiod by protoins which aro prosont in SIGNOR but not in GM12878 transcriptional network (cluster of green color points for CheiRank and PagoRank around 1000). Vice versa, there is a region in K562 network (Figure 1.right) occupied by proteins which are present in SIGNOR but not in K562 transcriptional network (cluster of dark bine points). This observation underlines the fact that both the composition and the wiring topology of “normal” and “cancer” networks have important differences.
2.2.2 Detecting “creative protein elements” by comparing PageRank and CheiRank to simple connectivity
A general statement on PageRank and CheiRank is that they are correlated to in-degree and out-degree of a node [21]. However, this correlation is not perfect as it can be soon in Figure 2. Some proteins significantly deviate from the general dependence trend, so it is interesting to consider what kind of non-local network topologies give unexpectedly high PageRank or CheiRank values despite relatively low connectivity degree. This deviation can be scored by a simple product of the rank and the corresponding degree, i.e., CEin = K·(InDegree+1),CEout = K* · (OutDegree + 1).
The notion of “creative elements” in the context of biological networks is discussed in [16] as such proteins that are not hubs of the networks themselves but can provide important (and, frequently, transient) connections between network hubs. Also, sometimes, together with hubs, proteins playing the role of “connectors” are discussed as those proteins having high centrality but not connectivity measures [3].
We suggest that the proteins significantly deviating from the the general trend between a Google matrix-based rank and the corresponding connectivity degree are potential candidates for the role of “creative elements” and connectors in the signaling and transcriptional networks. Several examples of such proteins and the local topologies explaining the deviation from the trend are provided in Figure 3.
For example, the top ranked in PageRank protein in all three networks is PIK3CB (look at Figure 1) having relatively low in-degree (10), which makes it to significantly deviate from the general trend (Figure 3A). PIK3CB gene encodes a catalytic subunit of the kinase PI3K, a key kinase involved in multiple cell signaling cascades. PIK3CB gene is frequently mutated or amplified in several cancer typos (such as lung squamous coll carcinoma whore its rate of mutations can be as high as 18%), which causes abnormalities in cell survival signaling. From the network point of view, it’s high value of PageRank is probably explained by the fact that, accordingly to SIGNOR, PIK3CB is regulated by several highly connected proteins which have predominantly incoming edges (PTEN. ERBB3. ERBB4. HRAS, NRAŠ, KRAS, IRS1).
An other example of such a protein is TEK which is ranked #4 by PageRank in SIGNOR network, having only 5 incoming edges. From Figure 3B one can see that TEK is the sink of the cascade TAL1→ANGPT2→TEK, which progressively collects incoming regulations, starting from the top connected hubs such as AKT1, MAPK3, PRKACA. The biological function of the TEK protein is quite unique: this is a receptor tyrosine kinase which has several immunoglobulin-like domains, three epidermal growth factor domains and three fibronectin typo III repeats in the extracellular part. This makes this protein potential regulator of multiple cellular functions such as angiogenosis, endothelial cell survival, proliferation, migration, adhesion and cell spreading, reorganization of the actin cytoskeloton, and also maintenance of vascular quiescence. Such rich functional cross-talk allows suggesting TEK as a creative element in the global cell signaling network.
Our final example of deviation from the major CheiRank vs out-degree trend is TLR4 protein, which is ranked # 22 by CheiRank in SIGNOR having only 2 out-going regulation and 1 self-interaction. From the Google matrix-based network analysis this can be explained by the fact that TLR4 triggers several cascades affecting downstream several major regulators having a large number of out-going edges such as AKT1. Indeed, the biological function of TLR4 (toll-like receptor 4) is activating the innate immune system which requires triggering many important cellular cascades (such as NFkB signaling), regulating a large number of cellular processes.
Other observations from Figure 2 underlies some particular features and differences between “normal” and “cancer” networks. For example, one can notice in Figure 3A,right that in the cancer network there is a large number of proteins with high number of incoming transcriptional regulations but not ranked well by PageRank. This feature is almost absent in the normal GM-SIGNOR network (Figure 3A, middle).
2.2.3 Biological meaning of PageRank and CheiRank changes in cancer
Having soon the differences in the distribution of PageRank and CheiRank in Figure 1 between the “normal” and “cancer” regulatory network, we characterized the relative change of the ranks by computing their log ratio between two networks. Overall, the relative changes in CheiRank had larger amplitude than in PageRank which can be partially explained by the different number of transcriptional targets of the transcriptional factors, described in the GM12878 and K562 transcriptional networks.
We characterized the biological functions represented by those proteins significantly deviating from zero in Figure 4, by applying the standard enrichment analysis based on hypergeometric tost, calculating what is the probability (p-value) of a selected protein set to intersect with some predefined protein set by random chance. For this purpose we used the toppgene bioinformatics package [11]. For the enrichment analysis, we took those proteins deviating from zero by two standard deviations of the distributions of PageRank of CheiRank log ratios, and analyzed the positive and negative sides of the distribution separately.
The results of the analysis are presented in Table 1 and online at http://www.ihes.fr/˜zinovyev/googlomics/grn2016/rankchanges/.
We’ve noticed, however, that the results of this analysis can be biased by a simple fact that a protein can be included in the “normal” transcriptional network GM12878 (thus having, for example, many transcriptional out-going interactions) and not at all included in the “cancer” transcriptional network K562 (thus having not at all transcriptional out-going interactions). This is the case, for example. for ZEB1 transcriptional factor. Despite the fact that such difference can be “real”, i.e. the transcriptional factor might be not expressed in the case of cancer, hence, does not regulate any genes, we’ve decided to perform an additional analysis focusing only at those proteins which simultaneously present in both “normal” and “cancer” transcriptional networks. These proteins might be present or not in the SIGNOR network. This second analysis made several biological functions detected in the previous analysis insignificant (normal text, lines in the Table 1) but many remained significant (bold text lines in the Table 1). For the second analysis, in case of ClieiRank. we took those proteins which deviated from zero by one standard deviation in the distribution of log ratio of the CheiRanks. in order to collect a sufficient number of proteins.
Overall, the undertaken analysis shows a picture consistent with the nature of the studied cells. I.e. we show that changes in the CheiRanks between normal and cancer cells highlights a number of proteins previously described as being implicated in leukemia (16 from 53 selected for the second analysis). Interestingly, this analysis highlights proteins implicated in the regulation of myeloid cell differentiation and hemopoiesis, which is also expected. 9 from 30 selected proteins were previously associated with the mouse phenotype characterized by the increased number of lymphocytes and significantly improved their ClieiRank in the cancer network (they became more powerfull regulators). Having many transcriptional factors in the transcriptional networks explains significance of such Gene Ontologies as “core promoter binding” and “chromatin” in the analysis of ClieiRank changes in both directions, and also interactions with key transcription co-regulators as EP300. HDAC1 and CREBBP.
At the same time, the analysis gives also some unexpected findings. For example, a number of proteins involved in translation (14 from 86 selected) or having the E2F transcription factor binding motif the promoter sequence (6 from 86) or being located in a specific genomic locus 22qll (7 from 86) improved their PageRank in cancer network (meaning they became more regulated). 8 from 63 genes with a specific motif in their promoter sequences (TMTCGCGANR) showed significant increase in the PageRank (meaning they became less regulated in cancer) is also an unexpected finding.
2.3 Inferring hidden causal relations between members of protein sets
Application of Google matrix to the global network allows quantifying the global ranking of protein nodes and their changes, as it was illustrated in the previous section. Reduced Google matrix (see Materials and Methods for formal description) allows focusing on a subset of nodes, quantify local importance of nodes in this subset and also detect indirect (hidden) connections between the members of the subset.
In order to test this approach in the context of biological networks, we’ve defined several protein sets, each of which contains a functionally related group of proteins. However, the meaning of the functional proximity is different in all cases.
We start with a definition of a biological pathway, which play one of the most central roles in all cancer types: AKT-mTOR pathway. It is a molecular cascade downstream of PI3K kinase which is important in regulating the cell cycle and cell survival, controlling a number of normal physiological processes, and being dysregulated in many diseases including cancer. We take the definition of the AKT-mTOR pathway from the external pathway database Atlas of Cancer Signaling Network (ACSN) [35]. based on manual mining of molecular biology publications, so the definition of this subset of proteins can be called “knowledge-driven”.
Second analyzed subset is by contrast purely “data-driven” and corresponds to a particular gene expression signature (set of genes), shown to be connected to cell proliferation in multiple cancer studies through data analysis. We used a particular definition of this signature coming from a large multi-cancer study of tumoral transcriptomes. using Independent Component Analysis (ICA) method [48,6]. Gene expression signatures obtained through statistical data analysis sometimes serve as a scaffold for reconstructing the topology of regulatory connections between the corresponding proteins. Therefore, we considered this set for determining direct and indirect connections between its members.
Third analyzed group of proteins is a set of known direct targets of a transcriptional factor E2F1 which is central to regulation and progression through the cell cycle. The member names of this set were manually extracted from reading the molecular biology literature on the functioning of cell cycle in order to reconstruct it as a biochemical reaction diagram [9]. In this case, the challenge is to understand what biological pathways can be directly-regulated by E2F1. what are the possible direct and indirect feedbacks to the regulation of E2F1 itself and how they are changing in cancer progression.
All three subsets are central to the studied in this paper cancer progression and its influence on the structure of biological networks. In all three cases, we roughly equilibratod the sizes of the protein sets, limiting them to approximately 50 proteins.
It happened that the direct interactions between the members of all three sets of proteins are described in SIGNOR pathway database, and not in the transcriptional networks. Therefore, one might consider that the wiring of direct connections between the set members are not affected by the changes in the transcriptional program. However, we further show that the structure of indirect connections might change accordingly to the changes in the global context created by the transcriptional network.
2.3.1 AKT-MTOR pathway
From SIGNOR database, we’ve retrieved 138 direct regu-latory connections between 63 proteins of AKT-mTOR pathway. These direct connections formed a large connected subnetwork containing the majority (43 proteins) of the pathway mombers, and the rest, was orphan nodes not. connected to any other.
No direct, transcriptional regulatory connections was found between the members of the pathway: hence, the structure of direct, connections did not. change in the “cancer” network with respect, to the “normal” network.
We’ve computed indirect, regulatory relations using the reduced Google matrix approach as described in Materials and Methods section, separately for SIGNOR, the “normal” and “cancer” global regulatory networks, combined the common signaling SIGNOR part, and the specific transcriptional network. The strength of the indirect, regulation can be evaluated by looking at its Gqr value.
For the “normal” network we found that the distribution of the corresponding values contains essentially close to zero values, with only some pointing to existence of indirect, regulation. Tims, for an arbitrarily chosen threshold Gqr > 0.01 one detects 50 indirect in-teractions, ten top of them are shown in Figure 5,A (in magenta color). These 50 indirect regulations connect 8 more proteins into the large connected component.
It can be noticed that the pattern of the top indirect connections is highly non-random and forms two “hidden patwhays”. one pointing to CASP3 protein through BCL proteins, and one connecting PRKA proteins to AKT1. Both hidden pathways have rather clear biological interpretation. The hidden regulations also point out to the important crosstalk between BCL2 and MAPK1. MAPK3 proteins, not represented by direct interactions inside AKT-mTOR pathway.
The first one can be related to the existence of apop-totic pathway in the global regulatory network, where BCL2 and BCL2L1 proteins play an important role. CASP3 serves the final point of the apoptotic pathway, being the main executor protein of the apoptotic process (the executor caspases take care of destroying the proteins of the suicided cell and the cell itself). In order to illustrate how BCL proteins and CASP3 are connected through the global network, we’ve computed the shortest and the second shortest oriented path between BLC2 and BCL2L1 proteins and CASP3 protein (Figure 5.B)· It can be seen that these paths include the main players of the apoptotic machinery (XIAP,CASP9,DIABLO,CYCS). Second hidden pathway connects the subunits of AMP-activated protein kinase (AMPK). an important energy sensor protein. to AKT1. As a conclusion, one can state that the reduced Google matrix approach was able to point out to biologically important and meaningful indirect connections between several AKT-mTOR pathway members.
In addition, we compared the inferred indirect interactions in “normal” and “cancer” global networks. We found that there is a strong correlation between all three set of values (correlation coefficients are close to 0.998). All strong indirect interactions inferred using the “normal” GM network were also found in “cancer" K562 network. However, in the “cancer” network we found additional candidates for indirect interactions, top ten of which are shown in Figure 5,A and Figure 5,C. It can be seen that such “emergent oncogenic” indirect interactions also underline existence of a “hidden” causal relation between RBXl and MAPK1 proteins.
2.3.2 Data-driven signature of proliferation-related proteins
We analyzed a set of 49 proteins found in SIGNOR database whose expression was shown to significantly change between fast proliferative and slow proliferative tumors in 9 cancer types [6]. We found 47 direct interactions connecting them into one large connected component consisting of 31 proteins who were predominantly the phosphorylation targets of the cyclin-dopondont kinase CDK1. so the structure of the network of direct interactions has a starlike structure organized around one large hub protein. As in the previous example, no direct transcriptional connections were found between the members of this protein set.
We’ve computed indirect regulatory relations using the reduced Google matrix approach as described in Materials and Methods section, separately for SIGNOR, the “normal” and “cancer” global regulatory networks, combined the common signaling SIGNOR part and the specific transcriptional network. As before, we’ve found only a minor fraction of all pair-wise protein relations as candidates for indirect interactions (only 32 from 2305 passed the threshold Gqr > 0.01). In Figure 6 we show all indirect regulations inferred in the “normal” GM network. As before. with indirect connections, it was possible to connect more proteins (43 out of 49). Thus, the most important indirect connection connects PCNA protein to the largest connected component of the network. PCNA (proliferating cell nuclear antigen) protein is a key cell cycle protein important both for DNA replication and DNA repair.
While comparing “cancer” and “normal” networks, unlike the previous example, we do not find new “emergent oncogenic” indirect interactions. Instead, we observed that several indirect interactions disappear in the “cancer” network. namely three indirect regulations connecting STIL protein to CCNA2, CCNE1 and CDK1. STIL is a cytoplasmic protein implicated in regulation of the mitotic spindle checkpoint, a regulatory pathway that monitors chromosome segregation during cell division to ensure the proper distribution of chromosomes to daughter cells. Interestingly. STIL protein was shown to be heavily deregulated in T-cell loukomias through genome modifications leading to gene fusions. Disappearance of indirect connections between STIL and CDK1 can be interpreted as loosening the control over several important cyclins (CCNA2 and CCNE1) and the key cell cycle protein CDK1 in cancer. Consistently, we find that the PageRank of the aforementioned cyclins increases (e.g. for the local subnetwork PageRanks, which means that they are less regulat ed/controlled in cancer. Several other proteins such as AURKA. AURKB. CHEK1. BIRC5. CDC25B decreases their local PageRanks in cancer which means they become more controlled. Interestingly. one of the direct targets of CDK1 kinase, protein BUBI, becomes a new hub of the indirect interactions. Indeed. this protein plays a central role in mitosis by phosphorylating members of the mitotic checkpoint complex and activating the spindle checkpoint.
2.3.3 Set of transcriptional targets of E2F1 transcription factor
In our last example, we use the reduced Google matrix method in order to better understand the structure of regulations of known in advance direct targets of a selected transcription factor E2F1. a key transcription factor regulating cell cycle progression. For 76 such proteins found in SIGNOR pathway database, we find 103 direct interaction connecting these proteins into the connected component comprising 49 proteins. One additional direct transcriptional transcriptional regulation was found between MYC and CBX5 proteins, but only in “cancer” K562 network.
The reduced Google matrix analysis revealed 84 indirect regulations for Gqr > 0.01 in the case of the “normal” network (Figure 7,A), all of which were also present in the “cancer” network. We did not find any additional indirect interactions from the analysis of the “cancer” network.
The majority of strong indirect interactions pointed to the 3 key apoptosis proteins CASP9. CASP3 and APAF1 (apoptotic prot ease activating factor) whose local PageRanks decreased in the “cancer” network (which means they become more regulated). As in the AKT-mTOR example. we find indirect regulations between BCL2. BCL2L1 and CASP3. However, unlike AKT-mTOR example, we did not observe significant changes of local PageRanks of BCL2 and BCL2L1. Overall, the reduced Google matrix analysis underlines existence of hidden indirect apoptotic program regulated by E2F1 (which is a known fact [5]).
We also found that many weaker indirect interactions between the targets of E2F1 ends up on the E2F1 itself (Figure 7,B), also through a key G1 /S cell cycle checkpoint protein CDKN1A (cyclin dependent kinase inhibitor 1A). Therefore. E2F1 itself can be regulated through a number of direct (3 in Figure 7,B) and even more indirect (4 in Figure 7,B) feedback regulations. This observation can provide hints on the principles of organization of the cell cycle transcriptional program.
3 Materials and Methods
3.1 Google matrix construction and properties
The Google matrix G of a directed network of N nodes is constructed from the adjacency matrix Aij which has elements 1 if a protein (node) j points to a protein (node) i and zero otherwise. Then the matrix elements of G take the standard form [8,37] where S is the matrix of Markov transitions with elements Sij = Aij/kout(j), being the node j out-degree (number of outgoing links) and with Sij = 1/N if j has no outgoing links (dangling node). Here 0 < α < 1 is the damping factor which for a random surfer determines the probability (1 — α) to jump to any node. The properties of spectrum and eigenstates of G have been discussed in detail for Wikipedia and other directed networks (see e.g. [18]).
The right eigenvectors ψ(j) of G are determined by the equation:
The PageRank eigenvector P(j) = ψi=0(j) corresponds to the largest eigenvalue λi=0 = 1 [8,37]. It has positive elements which give a probability to find a random surfer on a given node in the stationary long time limit of the Markov process. All nodes can be ordered by a monotonically decreasing probability P(K) with the highest probability at K = 1. The index K is the PageRank index. Left eigenvectors are biorthogonal to right eigenvectors of different eigenvalues. The left eigenvector for λ = 1 has identical (unit) entries due to the column sum normalization of G.One can show that the damping factor α in (1) only affects the PageRank vector (or other eigenvectors for λ = 1 of S in case of a degeneracy) while other eigenvectors are independent of * due to their orthogonality to the left unit eigenvector for λ=1 [37]. Thus all eigenvalues, except λ=1, are multiplied by a factor α when replacing S by G. In the following we use the notations and for left and right eigenvectors respectively (here T means vector or matrix transposition).
In many real networks the number of nonzero elements in a column of S is significantly smaller than the whole matrix size N that allows to find efficiently the PageRank vector by the PageRank algorithm of power iterations [37]. Also a certain number of largest eigenvalues (in modulus) and related eigenvectors can be efficiently computed by the Arnoldi algorithm (see [18] and Refs. therein).
In addition to the matrix G it is useful to introduce a Google matrix G* constructed from the adjacency matrix of the same network but with inverted direction of all links. The statistical properties of the eigenvector P* of G* with the largest eigenvalue λ = 1 have been studied first, for the Linux Kernel network [13] showing that there are nontrivial correlations between P and P* vectors of the network. More detailed studied have been done for Wikipedia and other networks [18]. The vector P*(K*) is called the CheiRank vector and the index numbering nodes in order of monotonic decrease of probability P* is noted as CheiRank index K*. Thus, nodes with many ingoing links have small value of K = 1,2, 3… and nodes with many outgoing links have K* = 1, 2, 3,…[37,18]. Examples of density distributions for Wikipedia editions and other directed networks are given in [18]. It is also useful to use 2DRank index K2 which represents a certain combination of K,K* indexes (K2 is the sequence of K,K* values appearing first, on a sequence of squares which have K = K* = 1 with size increasing one by one up to maximal N value, see details in [18]).
At α < 1 only the PageRank vector have λ =1 while all other eigenvectors of G have |λ| ≤ α [37,18]. For Wikipedia is was shown that the eigenvectors with a large λselect some specific communities of Wikipedia network [18]. However, a priory it. is not possible to know what are the meanings of these communities. Thus other methods are required to determine effective interactions between Nr nodes of a specific subset (group) of the global network of a large size N ≫ Nr.
In this work we apply the Google matrix analysis to the directed network of protein interactions from the cancer database SIGNOR [42]. and two hybrid networks, constructed by merging SIGNOR to two transcriptional networks measured in normal blood cells and in cancer (leukemia). The SIGNOR directed network contains N = 2432 proteins (nodes) with the total number of links Nl = 6569. In all our analysis we use the typical damping factor value α = 0.85 [37].
For the studied protein networks the dependencies of PageRank and CheiRank probabilities on rank indexes are shown in Figure 8. The decay of probabilities is approximately described by a power law P ∝ 1/Kβ; P* ∝ 1/K*β with the decay exponent β in a range 0.5 — 1. However, this is only an approximation for a whole curve. The distribution on nodes on the PagoRank-ChoiRank plane is shown in Figure 1. The spectra of G and G* are shown in Figure 9.
3.2 Reduced Google matrix
Recently, the method of reduced Google matrix has been proposed for analysis of effective interactions between node: of a selected subset embedded into a large size network [19]. This approach uses parallels with the quantum scattering theory, developed for processes in nuclear and mesoscopic physics and quantum chaos.
It turns out that the Google matrix GR matrix describing the interactions inside a group of proteins is composed of three matrix components which describe the direct interactions between group members, Grr, a projector part Gpr which is mainly imposed by the PageRank of selected proteins given by the global G matrix and a component Gqr from hidden interactions between proteins which ap-pear due to indirect links via the global network. Thus the reduced matrix GR = Grr+Gpr+Gqr allows to obtain precise information about the group of proteins taking into account their environment given by the global network.
The concept of reduced Google matrix GR was introduced in [19] on the basis of the following observation. At present directed networks of real systems can be very large (about 4.2 millions for the English Wikipedia edition in 2013 [18] or 3.5 billion web pages for a publicly accessible web crawl that was gathered by the Common Crawl Foundation in 2012 [38]). In certain cases one may be interested in the particular interactions among a small reduced subset of Nr nodes with Nr ≪ N instead of the interactions in the entire network. However, the interactions between these Nr nodes should be correctly determined taking into account that there are many indirect links between the Nr nodes via all other Ns = N — Nr nodes of the network. This leads to the problem of the reduced Google matrix GR with Nr nodes which describes the interactions of a subset of Nr nodes.
In a certain sense we can trace parallels with the problem of quantum scattering appearing in nuclear and mesoscopic physics [44,4,28] and quantum chaotic scattering [24]. Indeed, in the scattering problem there are effective interactions between open channels to localized basis states in a well confined scattering domain where a particle can spend a certain time before its escape to open channels. Having this analogy in mind we construct the reduced Google matrix GR which describes interactions quirements of the Google matrix.
Let G be a typical Google matrix of Perron-Frobenius type for a network with N nodes such that Gij ≥ 0 and the column sum normalization are verified. We consider a sub-network with Nr < N nodes, called “reduced network”. In this case we can write G in a block form : where the index “r” refers to the nodes of the reduced network and “s” to the other Ns = N — Nr nodes which form a complementary network which we will call “scattering network”.
We denote the PageRank vector of the full network as which satisfies the equation GP = P or in other words P is the right eigenvector of G for the unit eigenvalue. This eigenvalue equation reads in block notations:
Here 1 is a unit diagonal matrix of corresponding size Nr or Ns. Assuming that the matrix 1 — Gss is not singular, i.e. all eigenvalues Gss are strictly smaller than unity (in modulus), we obtain from (6)that which gives together with (5): where the matrix GR of size Nr × Nr, defined for the reduced network, can be viewed as an effective reduced Google matrix. Here the contribution of Grr accounts for direct links in the reduced network and the second term with the matrix inverse corresponds to all contributions of indirect links of arbitrary order. We note that in mesoscopic scattering problems one typically uses an expression of the scattering matrix which has a similar structure where the scattering channels correspond to the reduced network and the states inside the scattering domain to the scattering network [4].
The matrix elements of GR are non-negative since the matrix inverse in (8) can be expanded as:
In (9) the integer l represents the order of indirect links,i. e. the number of indirect links which are used to connect indirectly two nodes of the reduced network. The matrix inverse corresponds to an exact resummation of all orders of indirect links. According to (9) the matrix (1 — Gss)-1 and therefore also GR have non-negative matrix elements. It can be shown that GR also fulfills the condition of column sum normalization being unity [19].
The results obtained in [19,20] show that the reduced Google matrix can be presented as a sum of three components with the first component Grr given by direct matrix elements of G among the selected Nr nodes, the second projector component Gpr is given by Here λc is the leading eigenvalue and by (ψR) the corresponding right (left) eigenvector such that . Both left and right eigenvectors as well as λc can be efficiently computed by the power iteration method in a similar way as the standard PageRank method. We note that one can easily show that λc must be real and that both left/right eigenvectors can be chosen with positive elements. Concerning the normalization for ψR we choose and for ψL we choose (the vector has all elements being unity). It is well known (and easy to show) that is orthogonal to all other right eigenvectors (and ψR is orthogonal to all other left eigenvectors) of Gss with eigenvalues different from λc. Here we introduce the operator which is the projector onto the eigenspace of λc and we denote by Qc = 1 — Pc the complementary projector. One verifies directly that both projectors commute with the matrix Gss and in particular PcGss = GssPc = λcPc.
We mention that this contribution is of the form being two small vectors defined on the reduced space of dimension Nr. Therefore Gpr is indeed a (small) matrix of rank one which is also confirmed by a numerical diagonalization of this matrix. The third component Gqr of indirect or hidden links is given by
Even though the decomposition (10) is at first motivated by the numerical efficiency to evaluate the matrix inverse it is equally important concerning the interpretation of the different terms and especially the last contribution (12) which is typically rather small as compared to (11) plays in an important role as we will see below.
Concerning the numerical algorithm to evaluate all contributions in (10). we mention that we first determine by the power iteration method the leading left ψL and right eigenvector ψR of the matrix Gss which also provides an accurate value of the corresponding eigenvalue λ or better of 1 — »c (by taking the norm of the projection of GψR on the reduced space which is highly accurate even »c close to 1). These two vectors provide directly Gpr by (11) and allow to numerically apply the projector Qc to an arbitrary vector (with ∼ N operations). The most expensive part is the evaluation of the last contribution according to (12). For this we apply successively Gss and Qc to an arbitrary column of Gsr which can be done by a sparse matrix vector multiplication or the efficient application of the projector. We compute simultaneously the series in (12) which converges rather quickly after about 200 terms since the contribution of the leading eigenvalue (of Gss) has been taken out and the eigenvalues of are roughly below the damping factor α = 0.85. In the end the resulting vector is multiplied with the matrix Grs which provides one column of Gqr. This procedure has to be repeated for each of the Nr columns but the number Nr is typically rather modest. We also note that the results obtained in [20] show that an approximate relation holds: where Σp is the PageRank probability of the global network concentrated on the subset of Nr selected nodes.
The results obtained here and in [20] for the Wikipedia network show that the contribution of Gpr is dominant in GR but it is also kind of trivial with nearly identical columns. Therefore the two small contributions of Grr and Gqr are indeed very important for the interpretation even though they only contribute weakly to the overall column sum normalization.
The meaning of Grr is rather clear since is gives direct links between the selected nodes. In contrast, the meaning of Gqr is significantly more interesting since it generates indirect links between the Nr nodes due to their interactions with the global network environment. We note that Gqr is composed of two parts Gqr = Gqrd+Gqrnd where the first diagonal term Gqrd represents a probability to stay on the same node during multiple iterations of in (12) while the second nondiagonal term Gqrnd represents indirect (hidden) links between the Nr nodes appearing due via the global network. We note that in principle certain via the global network. We note that in principle certain due to negative terms in Qc = 1 — Pc appearing in (12).However, for all subsets considered in this work the total weight of negative elements was negligibly small (at most some 10-3) of the total weight 1 for GR).
It is convenient to characterize the strength of 3 components in (10) by their respective weights Wrr, Wpr,Wqr given respectively by the sum of all matrix elements Grr, Gpr, Gqr divided by Nr. By definition we have Wrr + Wpr + Wqr = 1. All numerical data of the reduced Google matrix of groups of proteins considered here are publicly available at the web site [49].
3.3 Global network reconstruction for GM12878 and K562 cell lines
The transcriptional networks for normal GM12878 and cancer K562 cell lines were obtained from the web-site http://encodenets.gersteinlab.org/ (files enets7.K562_piOxinial_filtered_network.txt. enets8.GM_piOxinial_filtered_network.txt) accompanying the original publication [25]. SIGNOR network for H. Sapiens was downloaded from the SIGNOR web-site http://signor.uniroma2.it/downloads.php. Both transcriptional and SIGNOR networks were represented as simple interaction format (SIF) files and merged by simple concatenation. They were further processed in in Cytoscape [14] with use of BiNoM plugin [47,7] for finding shortest and second shortest paths, and copy-paste operations.
3.4 Definitions of functionally related groups of proteins
The composition of AKT-mTOR pathway and the set of direct transcriptional targets of E2F1 protein were downloaded from the Atlas of Cancer Signaling Network (ACSN) database [?]. by using GMT files of version 1.1 available from the ACSN web-site http://acsn.curie. f r (sots E2F1 TARGETS and AKT-mTOR gone sets).
Tho group of protoins related to proliferation was determined as a set of 50 gene names top-contributing to the transcriptomic signature associated to the cell cycle through a large-scale pan-cancer analysis of transcriptomic data [6], using the lists provided in the Supplementary information.
4 Discussion
The results of application of high-throughput technologies in modern molecular biology are more and more frequently presented in the form of complex networks, representing measured causal relations between biological molecules. For example, systematic application of Chip-Seq technology for a significant number of transcription factors can result in the global cell line-specific reconstruction of the transcriptional network [25]. Despite many methods aimed at the analysis of complex networks, there is still a need for mathematically rigorous and computationally efficient methods able to quantify complex non-local network topologies, especially in the case of directed networks.
In this work we show that the global Google matrix and the reduced Google matrix approaches represent useful tools for the analysis of directed interaction networks in biology.
We show that the global analysis of a directed biological network using Google matrix and by computing node PageRanks and CheiRanks and their relative changes in cancer allows obtaining insights about specific and precise aspects of how the biological network topology evolve in different biological contexts.
The reduced Google matrix approach is a novel method allowing quantifying indirect (hidden) connections between members of a specified subset of network nodes. These connections represent paths of oriented graph edges through the global network and involving nodes outside the specified set. This approach is applied to the global network of directed protoin-protein interactions, with a focus on some groups of proteins corresponding to a well-defined biological function (cell survival signaling, cell proliferation), obtained by different methods (prior knowledge-based or by data-driven approaches). We show that application of the reduced Google matrix approach leads to inferring a meaningful set of indirect interactions highlighting existence of specific biological programs not reflected in the structure of direct relations between the members of a protein set. We also show that the structure of such hidden relations can be modified from one condition to another, reflecting some global changes in the wiring of, for example, global transcriptional networks during cancer or differentiation.
There are multiple possible ways to exploit the methods suggested in this study. One of the promising application is in mathematical modeling of biological processes, i.e. mathematical modeling of molecular pathways [10,27,15,41]. Construction of a mathematical model of a pathway usually starts with defining the restrictive set of biological molecules or processes most closely related to the studied phenomenon (i.e., regulation of programmed cell death). In the current methodologies, the number of such model elements (proteins) can not be very large. Therefore, there is always a danger of neglecting important indirect causal relations between the elements via regulations passing through the global network in which a given pathway is embedded. The reduced Google matrix method allows systematically inferring indirect regulations, in a context-specific manner which allows to use in this analysis the results of high-throughput biotechnologies, as it is demonstrated in the current study.
Note that indirect regulations can involve too many proteins in order to characterize them by ad hoc methods such as counting the number of paths connecting two proteins. The suggested method takes into account the directed network in its whole complexity without naive simplifying assumptions. Moreover, the method is computationally efficient for the typical sizes of the biological networks involving tens of thousands of nodes and hundreds of thousands of interactions.
Google matrix, or Googlomics, methodology can be used in other types of directed networks appearing in biology such as state transition graphs resulting from the analysis of Boolean models of pathways [1].
Overall, the developed methodology allows combining global structural analysis of large biological networks characterized by context-specific and dynamical re-wiring together with the focused analysis of specified biological processes, without neglecting the role of the global context.
5 Acknowledgements
This research is supported in part by the MASTODONS-2016 CNRS project APLIGOOGLE (see http://www.quantware.ups-tlse.fr/APLIGOOGLE/).