Skip to main content
bioRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search
New Results

Active modules for multilayer weighted gene co-expression networks: a continuous optimization approach

Dong Li, Shan He
doi: https://doi.org/10.1101/056952
Dong Li
School of Computer Science, The University of Birmingham, UK Zhisong Pan, Guyu Hu
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Shan He
School of Computer Science, The University of Birmingham, UK Zhisong Pan, Guyu Hu
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Supplementary material
  • Preview PDF
Loading

Abstract

Motivation Searching for active connected subgraphs in biological networks has shown important to identifying functional modules. Most existing active modules identification methods need both network structural information and gene activity measures, typically requiring prior knowledge database and high-throughput data. As a pure data-driven gene network, weighted gene co-expression network (WGCN) could be constructed only from expression profile. Searching for modules on WGCN thus has potential values. While traditional clustering based modules detection on WGCN method covers all genes, unavoidable introducing many uninformative ones when annotating modules. We need to find more accurate part of them.

Results We propose a fine-grained method to identify active modules on the multi-layer weighted (co-expression gene) network, based on a continuous optimization approach (AMOUNTAIN). The multilayer network are also considered under the unified framework, as a natural extension to single layer network case. The effectiveness is validated on both synthetic data and real-world data. And the software is provided as a user-friendly R package.

Availability Available at https://github.com/fairmiracle/AMOUNTAIN

Contact s.he{at}cs.bham.ac.uk

Supplementary information Supplementary data are available at Bioin-formatics online.

1 Introduction

As a well-known fact, a group of genes may get involved into a biological process other than act alone [3], thus identifying a group of genes and associating it with certain biological functions is of important. In this paper, we define a functional module in a biological network as a subnetwork which may involve a common function in biological processes. Another important but different concept is topological module, may also be referred as community, within which the interactions are much more intensive compared with those outside [15]. Topological modules have been studied intensively and the modular structure is easy to be detected in general sense, but functional modules are of real interest [3].

Although a function module may overlap with a topological module, only using the network structural information is not enough to find the function modules. The topology of a biological network does not always precisely reflects the function or even disease-determined regions [3], which are the real concerns in biology. To bridge the gap between the topological module and functional module, searching for active modules, i.e. connected regions of the molecular interaction network showing striking changes in molecular activity or phenotypic signatures that are associated with a given cellular response [28], has become a central challenge in system biology. Active modules were shown to be able to reveal regulatory mechanisms [19], which closely to function modules, i.e., these modules might connect multiple function modules. The activities of network nodes are usually measured by high-throughput omics data. In recent years, many active module identification algorithms have been developed to solve this problem, and most of them are applied on a skeleton networks plus muscle paradigm. The skeleton networks like protein-protein interaction networks or metabolic networks are constructed from prior knowledge database [19, 27, 10]. However, compared with increasing vast amount of high-throughput omics data, the speed of constructing reliable and complete skeleton networks, which heavily rely on experiment and human validation, is quite slow. For some non-model species, or even some new model species such as Daphnia, the PPI network even does not exist. The lack of reliable skeleton networks posts a challenge to the detection of active module to reveal certain key mechanisms in the biological systems.

In contrast, gene co-expression network is a pure data-driven gene network, which could be constructed only from expression profile. In such networks a nodes is a single gene and an edge is the correlation relationship between a pair of genes. And a weighted gene co-expression network is a fully connected graph. Modules in such networks are also considered to participate into some biological process [44], and those modules with significant biological meaning are essentially functional modules.

As the first but crucial step for modules functional annotation analysis, module identification on gene networks is an important but less studied topic. Traditional module detection on gene co-expression networks was based on gene clustering, i.e. putting similar genes based on their correlations or edge weights into clusters as modules [44]. The coarse-grained clustering technique basically covers all genes in the network. As a results, the identified functional module including those genes show very little activities, which might not be very informative to reveal the biological mechanisms. We hypothesize that by identifying active modules which consider gene activities in the coexpression network, more precise biological mechanisms would be obtained. How to rigorously define active modules in such weighted network is still an open problem, but the module itself should be be more compact and informative compared with random subnetwork or clusters from two perspectives: 1) From the topological view, active modules are supposed enjoy high module scores measured by nodes and edges. 2) From the functional view, active modules are supposed to be more significantly associated with some biological process. The active modules identification on gene co-expression network is also a new problem, especially for the multilayer gene networks. A better understanding of modules in such network is expected to establish a pipeline from pervasive gene expression data to reasonable biological interpretation on a systemic level.

As a generalization of the single network, there are various reasons to model the interactions in living organisms as multilayer networks. Different layers may represent different time points, multiple conditions or various species. Modules across multiple layers in this report may reveal some properties of weighted gene co-expression networks such as time-invariant component genes, general responsive functional modules, and species conservation biological process. Similar but more general topics are also called multiplex networks [29, 22]. An existing work [24] mined recurrent heavy subgraphs in multi-slice networks, where each network shares the same set of genes without interactions between them. Conversed modules identification examples include [48, 13] which will be mentioned later. Inspired by [24], we develop a unified optimization framework to identify active modules on the multi-layer weighted co-expression gene network (AMOUNTAIN), and the layers could cover all three cases mentioned above.

2 Methods

In the early settings, active module identification was proposed to find significantly changed subnetwork in modular interaction networks [19]. Most following works developed methods based on the “skeleton+muscle” paradigm, where the “skeleton” is basic molecular interaction network constructed from prior knowledge database, and the “muscle” comes from the widely available high-throughput data which measure the genes activities. In general the skeleton biological network is represented as an undirected graph G = (V, E), where nodes in V represents components like genes (or gene products), proteins or metabolites, and edges in E represents the interaction between two nodes. Each node i is assigned a single score to denote the activity of corresponding component in certain condition, such as fold-change or p-value of gene expression level. The simplified problem of finding highest score module in unweighted network, which consider the subnetwork score is the sum of each node’s score, is formally defined as following:

problem 1

Given a graph G = (V,E) with vertices weights z ϵ Rn for each v ϵ V find a connected subnetwork S = (VS, ES) of G with maximal weight Embedded Image

This combinatorial optimization problem is also called Maximum-Weight Connected Subgraph Problem (MWCSP), which is equivalent to finding a maximum weight clique in a weighted graph, being referred as a famous NP-complete problem [20]. The proof is provided as supplementary materials of Ideker et al. [19]. As effective tools to solve combinatorial problems, metaheuristic algorithms have been widely applied to search satisfied solutions. The original paper [19] proposed to use simulated annealing, a generic probabilistic metaheuristic to solve this problem. Other methods include extended simulated annealing [17], greedy algorithm [40, 41], graph-based heuristic algorithm [34], genetic algorithm [27] and some exact approaches based on integer linear programming [33, 10, 46, 2].

2.1 Single-layer network

Compared with increasing vast amount of high-throughput data, the speed of constructing or confirming precise molecular interactions, which heavily rely on experiment and human validation, is quite slow. With only gene expression data, we could build gene co-expression network (GCN). We generalize the idea of active modules in problem (1) to GCN case, and consider the node score as gene importance criterion in gene co-expression network, which can be evaluated by the expression level changes (such as fold change or other more comprehensive statistics) under certain conditions. Meanwhile the network structure is also determined by gene expression data and expressed as a weighted network. In practice, we get a bit more different problem to find a subgraph of size k (otherwise it corresponds to a trivial case containing all nodes) which aims to has both maximal node weights and closely connected to has large edge weights, formally defined as:

problem 2

Given a complete graph G = (V,E), with vertex weight zv ϵ R for each v ϵ V and edge weights W = [wij] for each edge (i,j), find a subgraph T of size k with large vertices weight Embedded Image and also edges weightsEmbedded Image.

Problem (2) is actually a simplified problem of (K1, K2)-Recurrent Heavy Subgraph (RHS) problem [24] but with additional node scores. (K1,K2)- RHS considers multiple co-expression networks, which is also discussed in details at next section. The module can be represented by membership vector x ϵ {0,1}n, where xi = 1 means i-gene belongs to the module. Thus the optimization is naturally expressed as: Embedded Image

The NP-hardness can be proved by reducing the well-known NP-complete problem k-clique to this problem, for the details of the proof refer to the supplementary materials. Although integer programming methods [33, 10, 46] can be applied, it may cause high computational complexity and be lack of theoretical guarantee w.r.t running time and accuracy. Alternatively, if we relax the integer constraints of x to continuous constraints [42, 24] and control the module size by introducing a vector norms of x, it becomes a nonnegative and equality constrained quadratic programming (QP) problem (2), which can be solved by various existing continuous optimization techniques in polynomial time. Embedded Image where f (x) is the vector norm. The lp-norm (p > 0) of x is defined as (∑ i |xi|p)1/p.

The choice of vector norm has an impact on how to solve the problem (2). For example, the 4)-norm can produce a sparse solution which is consistent with the fact that only a few members belong to the module. The l0-norm is widely used as an alternative to l0 since the optimal solution of the latter corresponds to a combinatorial problem [11]. The l2-norm is also widely used for closed-form solution but cannot lead to a sparse solution. A linear combination of l1 and l2, i.e. (1 — α)‖x‖1 + α‖x‖2 is also called elastic net penalty [49], when the objective is least square and α = 0 corresponds to lasso [38] and α = 1 corresponds to ridge regression. Elastic net is considered to enjoy the characteristics of both lasso and ridge regression. Besides, the l∞-norm is max{x1, x2,…, xn} which makes the values in vector smooth and all the entries are roughly identical. All the mentioned vector norms and the existing corresponding optimization techniques may be applied in the problem (2) as long as the constraints of vector norms can reveal some essence of module membership in weighted co-expression network. While the fundamental one is l1-norm since the module size needs to be constrained, otherwise, the problem becomes trivial with all nodes included in the target module.

The l1-norm constraint of optimization problem (2) can be converted to Embedded Image since xi≥ 0. This problem has been intensively studied in mathematical programming [7, 4] and could be solved by a lot of existing method. But only using l1-norm constraint tends to produce very sparse solution for problem (2), even a vector x which contains only one non-zero element may be the optimal solution. [24] used the mixed norm l0,∞,(x) = =‖x‖0 + (1 — α)‖x‖∞ (0 < α < 1) to encode the characteristics of gene membership, where the optimal vector should contain equal non-zero values and the rest zero values. In practise l0,∞, was approximated by lp,2(x) = α‖x‖p + (1 — α)‖x‖2 (0 < p < 1), and they solved the non-convex problem (1) with only edge weights by MultiStage Convex Relaxation (MSCR) [45].

It is a natural idea to use the elastic net penalty [49] to control the sparsity and achieve desirable membership, i.e. f (x) = (1 — α)|x|1 + α‖x‖2. And a general strategy follows the gradient projection method [25] and generate a sequence to approximate the accurate solution like Embedded Image where ∏C is the Euclidean projection of a vector on convex set C, defined as (4): Embedded Image where g is a constant vector and t is the radius which has little impact on the solution in practice. The step size α(k) in k-step should satisfy the followingcondition (5) in order to make the objective function: Embedded Image where σ is a small positive constant. Searching for optimal α(k) is time consuming. Here we adopt the same as in [25] that scale α(k) by a fixed factor β until α(k) satisfies (5). Thus the algorithm is guaranteed to converge.

Solving subproblem (4) involves a root finding procedure [16] which can be done in linear time, and the total iterative procedure can be improved up by Nesterov’s method [30], which replace the current step x(k) in (3) with a linear combination of previous two steps, s(k) = x(k) + tk(x(k) — x(k—1)) where tk is another parameter to make it convergence. Nesterov’s method has been shown to have optimal convergence rate for first-order method. Refer to the supplementary material the details about how to solve the convex optimization problem (2) with Elastic net penalty.

Algorithm 1
  • Download figure
  • Open in new tab
Algorithm 1

Euclidean projections optimization

Generally we may want to identify multiple modules from one network. Similar to [47, 26], we can find N modules by running algorithm 1 N times, with each time simply extracting the resulted module from background network. The resulting sequences of modules may indicates the importance in terms of node activities and correlation similarities, in a descending order. The general procedure for identifying N modules from given gene expression profile can be summarized as algorithm 2.

2.2 Two-layer network

The biological system has been modeled as a multi-layer network before [5], while more detailed analysis on multi-layer networks raises up in recent years [21], after the intensive research on single layer networks. The multi-layer network provides a general framework to model temporal and spatial change of interactions for cellular networks, and contributing three aspects for current computational biology research: 1) Modeling dynamic properties for biological process as multiple snapshots, 2) Modeling different responses to multiple conditions of the same species and 3) The identification of conserved genes across species as well as specific genes.

Algorithm 2
  • Download figure
  • Open in new tab
Algorithm 2

Euclidean projections optimization

Although the behaviors of living organisms were considered to be dynamic, traditional network-based methods primarily focused on static network, which is a snapshot of the real case. Multi-layer networks provide a powerful tool for modeling this time series networks, with each layer standing for a time point. The differential analysis on this time series networks may reveal several important concepts related to time changing on cells belongs to certain species or tissues, such as the key components that were not be affected or some vanishing structures.

The integration of genomic techniques into environmental toxicology has shown potential application values to develop exposure biomarkers and investigate the mode of toxicity [31]. Representing the interaction of a set of genes’ response under different conditions as multiple layer network may shed light no this evolutionary conservation of living organism.

Besides the dynamic change of networks in the same set of components, a multi-layer network can also capture the core set across species, such as conserved active modules [37, 9]. By constructing a multi-layer network with each layer representing one species, we may find similar patterns in a species that has relatively more prior information, thus to gain biological knowledge of interested species. Finding conserved modules may also improve our understanding about the evolutionary biological procedure by highlighting the similarities and differences in key patterns between species [48].

Although [24] was proposed for multiple gene co-expression networks, which is similar to multi-layer networks. They can be distinguished from two aspects: 1)Multiple networks share exactly the same set of nodes while multi-layer networks do not necessarily to, which makes later can be applied in multiple species orthology. 2)Multi-layer networks consider inter-layer interactions while multiple networks do not to.

Another related work is xHeinz [13], which mines cross-species network modules using an integer linear programming approach. It extends the single network module identification algorithm heinz [10] to two species case and takes the same optimization technique for the problem. As discussed before, this kind of algorithm requires the skeleton interaction network plus high-throughput data, which is the main difference between our work. The network node scoring function in xHeinz is inherited from that in heinz, which requires the parameters estimation in a beta-uniform mixture (BUM) model. While our algorithm simply uses the fold-change information since node score is only part of the whole objective. In theory we can also use the adjusted log-likelihood ratio score in heinz. The optimization methodology we adopt is also straightforward, while xHeinz relies on external integer programming solver CPLEX.

Being similar to single layer network situation, an active module in a two layers-network can be represented as a connection of two modules in two different networks G1 = (V1,E1) and G2 = (V2,E2). The inter-layer interactions were measured by A = [a]i,j ϵ ℝn1×n2 where n1 and n2 are the numbers of nodes in G1 and G2. The basic two layer network module identification problem is formally defined as

problem 3

Given two complete graphs G1 = (V1,E1) and G2 = (V2,E2), with vertices weights z1 v ϵ R for each v ϵ V1 and z2v ϵ R for each v ϵ V2. And edges weights W1 for edges in G1 and W2 for edges in G2. The interlayer interactions were measured by A = [a]ij ϵ ℝn1×n2. The goal is to find two subgraphs T1 ϵ G1 and T2 e G2 which both have large vertices weights and edges weights as well as intensive interaction with each other.

We use two two variables x and y to represent the memberships of active modules in two different networks, xi = 1 means the i-th node in the first network is in the module. Thus the optimization problem can be expressed as an extension to (2), Embedded Image where f1(x) and f2(y) are the vector norms on two vectors respectively. For simplicity we use the same Elastic net penalty Embedded Image or mix norm penalty f(x) = α‖x‖p + (1 — α)‖x‖2 for both x and y. The general idea for solving (6) is alternating optimization, i.e. a iteratively optimize one variable while fixing another each time [25]. When optimizing one variable, it has the same form as in (2). Each iteration in the procedure can be simply expressed as:

  • Find x(k+1) such that F(x(k+1), y(k)) ≤ F(x(k), y(k)) and,

  • Find y(k+1) such that F(x(k+1), y(k+1)) < F(x(k+1), y(k))

The complete algorithm to find multiple modules in the two-layer network shares the same structure of algorithm 2. There is another parameter λ3 in (6) controlling how much degree the inter-layer links affect the resulting modules. Take multi-species for example, large λ3 can leads to conserved modules across different species which may reveal some gene conservation in response to certain changes. Conversely, small λ3, e.g λ3 = 0 makes the interlayer information playing no role, thus leading to two independent module identification processes. If we have multiple layers more than two, the rational keeps the same. As long as each layer has different set of nodes, alternating optimization can be used as the same way as in two-layer networks. Otherwise a more compact tensor computational paradigm [24] can be more efficient without inter-layer links consideration.

3 Results

3.1 Synthetic data

Several related works have used artificially generated data [34, 40, 23, 33] in order to test their algorithms in single network module identification. Being different from previous networks, the simulated networks here should have clear topological structure as well as node scores. We follow [24] to construct gene co-expression networks for simulation study. Let n be the number of genes, and edge weights as well as node score follow the uniform distribution in range [θ,1]. A module contains k genes inside which the edge weights as well as node score follow the uniform distribution in range [9, 1], where θ={0.5, 0.6, 0.7, 0.8, 0.9}. Figure 1 shows the weighted co-expression network when n = 100, k = 20 and red nodes indicate module members and wider edges mean larger similarities. Visualization is based on qgraph [14].

With ground-truth in hand, we can define the following performance measurement for the problem (2). From the topological view, the accuracy for each single module identification is considered as a binary classification problem. The performance is defined by the 2 × 2 confusion matrix, including the number of genes correctly detected (True positive, tp), the number of genes in identified module but not in the real module (False positive, fp), the number of genes in the real module but not in the identified module (False negative, fn), and the number of genes neither in identified module or in the real module (True negative, tn).

Figure 1:
  • Download figure
  • Open in new tab
Figure 1:

Simulated weighted gene co-expression network.

Embedded Image

We conduct the simulation study on a relatively large network when n = 10000, and we consider a sparse case where one module contains k = 100 genes and node score follow the uniform distribution in range [θ,1], where θ= 0.5. The algorithm which uses mixed norm in (2) is outperformed by our algorithm 1 using elastic net penalty, in terms of both the running time and predictive accuracy. By choosing proper parameters from grid search combining α={0.1 ∼ 0.9} in elastic net penalty and λ=2{-5∼5} in (2), we can almost exactly find the target model nodes. The optimal α={0.3 ∼ 0.4} and λ = 2{-5∼—1} for this network which makes F = 1. Figure 2 shows how these parameters affect F-score (7) in this case. A similar result is also found in two-layer network simulation.

Figure 2:
  • Download figure
  • Open in new tab
Figure 2:

Parameters selection for algorithm 1 on large network.

3.2 Real-world data for single layer network

We mainly use gene expression datasets from Gene Expression Omnibus (GEO) [12] as real-world data examples. GSE3635 and GSEGSE5283 are two expression profiling by array from Saccharomyces cerevisiae. The original goal of the dataset is study the the regulation of transcription factor YOX1 and YHP1 during the cell cycle of Saccharomyces cerevisiae [32], with deleted YOX1 and YHP1 deleted. The wild-type and mutant cells were collected at 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, and 120 minutes after synchronization with alpha factor. We take the formal as control group and second as experimental group since they share the same set of genes. And we assume the biological processes with YOX1 and YHP1 deleted was measured by gene expression values. The goal is to find several set of genes, referred as modules identification using algorithm 1, which are closely related to these biological processes. The gene expression values has been normalized as log ratio using Rosetta Resolver, we only need to deal with missing or invalid values. The strategy is simple, discarding probes with more than 20% missing values or NAs and repalcing missing or NAs positions in a valid probe with mean value of the rest samples of that probe. We donot pick up signigicantly expressed genes using linear model like many other methods, since the algorithm requres as complete information about gene correlation and gene activities only contribute part of the whole objective.

First, algorithm 1 requires the a weighted gene coexpression network as input. We employ the same differential analysis method of gene pairs as [18]. The difference of coexpression, also the edge score of gene i and j between two conditions a (control) and b (experimental) was quantified by, Embedded Image where Embedded Image is the correlation between gene i and j under condition b. The node score which reflects the gene expression activity degree is measured by the ratio of expression level under two conditions, Embedded Image where Embedded Image is the gene expression value of gene i in j-th sample under condition b, and m, n are the number of sample in condition b and a respectively. Here we let N = 10 in algorithm. The parameter λ represents the trade-off between edges weight and nodes weight which seems to play a slight role in performance. Here we simply fix λ = 1 and use a binary search method to select a for elastic net penalty which controls the sparsity of the module. See supplementary text section 4 for usage of parameter selection. Here we desire each module size with around 100 200 genes. These modules are provided as gene list (S1.xlsx).

We performed functional enrichment analysis of these modules. The basic of functional enrichment of a module is to assign the biological process annotations in Gene Ontology [1] to the genes (proteins) in that module. The probability that a module of size n have the same function as an existing functional module can be calculated by a hypergeometric distribution with Gene Ontology Term Finder. The P-value is calculated by the following formula, Embedded Image where N is the total number of genes (proteins) within the genome and M is the total number of genes (proteins) within a category. A low p-value indicates genes have high overlap with enriched functional categories thus are biologically significant. Results show that module identification using algorithm 1 can find consistent biological processes only from expression data which help to understand the underlying mechanisms related to biological conditions. Since YOX1 and YHP1 are important transcription factors in the regulation of the cell cycle, the identified modules are enriched by corresponding biological processes such as single-organism cellular process (G0:0044763), cellular macromolecule metabolic process (G0:0006139) and nucleobase-containing compound metabolic process (G0:0044260). Furthermore, the enriched GO terms are less significant as they are later identified, which corresponds with the algorithmic settings.

3.3 Real-world data for two-layer network

Inspired by xHeinz [13], we chose two expression data for two species mus musculus and homo sapiens: GSE43955 and GSE35103 for multilayer case. The original studies [43] and [39] reported the expression profiles identification controlled by the differentiation of Th17 cell. Here we expect the proposed algorithm could find consistent results from a two-layer gene co-expression network. Each layer of the network is constructed from the gene expression of a species, and the edge weights and node weights are defined by (8) and (9) as well. In each layer the two conditions are with or without Th17. Specifically, we use the expression value of two conditions in different time points to check whether the effect may vary alone the time. The inter-layer connections are defined by the orthology information, obtained from Ensembl 84. We use the associated gene name as the unique identifier for each gene (node) in both human and mouse, and the corresponding orthologos mapping table are embedded into this two-layer network. After gene expression data pre-processing and orthologos selection, we get 19332 genes in human layer and 13656 genes in mouse layer. There are 8066 links between two layers, standing for confident orthologous mapping pairs.

As the same in single layer case, we use the binary search for parameter α in elastic net penalties to get the desired modules size for both layers. It turns out to be a grid search process in order to get two desired modules at the same time. Here we only consider the large λ3 in (2) which aims to find a conserved module across these two species. We use samples from all conditions to construct the basic weighted co-expression networks for two species, but different expression values under different time to define node activities. Because correlation based network construction requires as many samples while gene activities are closely related to certain conditions, including the exposed time period.

Figure 3 shows the most active module (the first identified) for human and mouse at the time point 2 hours, visualized by muxViz [8]. The conserved module is acquired when λ3=1000 in (2), where inter-layer links mean the orthologous gene pairs. The gene lists of two modules are attached in table file (S2.xlsx). Gene ontology enrichment analysis indicate that there are several GO terms such as response to virus (G0:0009615) and cellular response to type I interferon (G0:0071357) are significantly enriched for mouse, while GO terms like response to endoplasmic reticulum stress (G0:0034976) and topologically incorrect protein (GO:0035966) for human. Both modules shows some cellular response to topologically incorrect protein (GO:0035967) or virus (GO:0051607), which is closely related to the functional role of Th17 differentiation played in pathogenesis of autoimmune and inammatory diseases [39].

Figure 3:
  • Download figure
  • Open in new tab
Figure 3:

Conserved modules for human and mouse. Visualized by muxViz [8]

Besides the early stage of Th17, we also explore the characteristics of other time points (12h 24h, 48h and 72h). Figure 5 shows the most active module (the first identified) for human across these four time points. We can see that some shared genes show activity along the time and play important roles in these networks. All gene lists are attached in table file (S3.xlsx). Gene ontology enrichment analysis show that all modules are significantly related to several biological processes, which are consistent with previous studies [].

Figure 4:
  • Download figure
  • Open in new tab
Figure 4:

Active modules for human across time points. Visualized by muxViz [8]

The first active modules identified in each time points themselves show some differences, indicating that modification of Th17 would consistently have an impact on the cell along the time. From the algorithmic point view, when fixing λ3 in (2) we need to seek the optimal parameters in the elastic penalties for both species. i.e f1(x) = α1‖x‖p + (1-α1)‖x‖2 and f (y) = α2‖y‖p + (1-α 2)‖y‖2. The grid search uses a binary search for each to seek a combination of two as for desired modules. Table 1 shows the results,

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 1:

Best parameters for equation (2)

We use an integrative gene list enrichment analysis tool Enrichr for human [6], which does not only provide the pathway and gene ontology enrichment analysis, but also has a visualization tool with each. Figure 5 shows top GO items enriched by the most active modules different in different time points and their relationships, check supplementaty file part 3 for other time points (Figure S1-S3). Although these modules have a lot genes in common, functions enriched by the first identified module changed slightly along the time. Structural constituent of ribosome (GO:0003735) appeared frequently as a top enriched term in all time points, served as a fundamental function.

Figure 5:
  • Download figure
  • Open in new tab
Figure 5:

GO terms network of the identified module at 12h

4 Discussion and Conclusion

There have been many works discussing about key individual components such as transcription factors which play important roles in a biological process. Popular tools such as limma [35] uses statistical models to find significantly expressed genes, which almost becomes a standard pre-processing for further study. Take Th17 cell differentiation for example, previous works [36, 43, 39] reveals important transcription factors involved and related mechanism. 0ur model is designed not for individual gene detection tasks such transcription factors identification or gene prioritization, but for providing a complementary from modular perspective. Modules identified from genome-wide network may reveal system-level properties of related biological mechanism. The goal of algorithm 1 is to establish a computational approach that only uses gene expression data from different conditions, to find biological modules which show significant responses caused by expression changes.

The continuous optimization method, especially the convex optimization also offers high efficient computational tools other than widely used heuristic algorithms [19, 48] or discrete optimization [10, 13] for active modules identification. Furthermore, convex optimization methods always enjoy the guarantee with respect to running time and accuracy. On the one hand, this guarantee makes the solution more reliable even unique given precise input. On the other side, the so-called optimal solution fully relies on algorithmic input which poses a higher demand on data preprocessing and model assumption. Take the real-word data studies in section 3.2 and 3.3 for example, slight differences on how to compute node scores or edge weights may lead totally different results. Conversely the uncertainty in heuristic algorithms may offer flexibility about model assumption and algorithmic input. In other words, the gap between computational model and real case does exist. From the software design and implementation view, open source and user-friendly tools have more advantages. A large number of reliable open source libraries can be easily found for mature convex optimization techniques. And it is not difficult to implement the core part of them. In contrast, the implementation of specific purpose heuristic algorithms or integer programming is challenging.

This paper describes a general continuous optimization based active modules identification method for multilayer gene coexpression networks. With proper replacement of node (gene) similarity matrix and node activities, the proposed methods can be easily extended to other applications. The idea of formulating the conserved modules identification across multiple layers under a uniform framework can also be enriched with more sophisticated considerations, such as multiple data source fusion instead of using single gene expression profiles. Future works may be more integrated with specific applications.

Funding

This work has been supported by the…

Footnotes

  • 1 http://www.yeastgenome.org/help/analyze/go-term-finder#pvalue

  • 2 http://www.ensembl.org/

References

  1. [1].↵
    M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, et al. Gene ontology: tool for the unification of biology. Nature genetics, 25(1):25–29, 2000.
    OpenUrlCrossRefPubMedWeb of Science
  2. [2].↵
    C. Backes, A. Rurainski, G. W. Klau, O. Muller, D. StUckel, A. Gerasch, J. KUntzer, D. Maisel, N. Ludwig, M. Hein, et al. An integer linear programming approach for finding deregulated subgraphs in regulatory networks. Nucleic acids research, 40(6):e43–e43, 2012.
    OpenUrlCrossRefPubMed
  3. [3].↵
    A.-L. Barabási, N. Gulbahce, and J. Loscalzo. Network medicine: a network-based approach to human disease. Nature Reviews Genetics, 12(1):56–68, 2011.
    OpenUrlCrossRefPubMedWeb of Science
  4. [4].↵
    S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
  5. [5].↵
    P. Brazhnik, A. de la Fuente, and P. Mendes. Gene networks: how to put the function in genomics. TRENDS in Biotechnology, 20(11):467–472, 2002.
    OpenUrlCrossRefPubMedWeb of Science
  6. [6].↵
    E. Y. Chen, C. M. Tan, Y. Kou, Q. Duan, Z. Wang, G. V. Meirelles, N. R. Clark, and A. Maayan. Enrichr: interactive and collaborative html5 gene list enrichment analysis tool. BMC bioinformatics, 14(1):128, 2013.
    OpenUrlCrossRefPubMed
  7. [7].↵
    T. F. Coleman and J. Liu. An interior newton method for quadratic programming. Mathematical programming, 85(3):491–523, 1999.
    OpenUrl
  8. [8].↵
    M. De Domenico, M. A. Porter, and A. Arenas. Muxviz: a tool for multilayer analysis and visualization of networks. Journal of Complex Networks, page cnu038, 2014.
  9. [9].↵
    R. Deshpande, S. Sharma, C. M. Verfaillie, W.-S. Hu, and C. L. Myers. A scalable approach for discovering conserved active subnetworks across species. PLoS computational biology, 6(12):e1001028, 2010.
    OpenUrl
  10. [10].↵
    M. T. Dittrich, G. W. Klau, A. Rosenwald, T. Dandekar, and T. MUller. Identifying functional modules in protein-protein interaction networks: an integrated exact approach. Bioinformatics, 24(13):i223–i231, 2008.
    OpenUrlCrossRefPubMedWeb of Science
  11. [11].↵
    D. L. Donoho. Compressed sensing. Information Theory, IEEE Transactions on, 52(4):1289–1306, 2006.
    OpenUrl
  12. [12].↵
    R. Edgar, M. Domrachev, and A. E. Lash. Gene expression omnibus: Ncbi gene expression and hybridization array data repository. Nucleic acids research, 30(1):207–210, 2002.
    OpenUrlCrossRefPubMedWeb of Science
  13. [13].↵
    M. El-Kebir, H. Soueidan, T. Hume, D. Beisser, M. Dittrich, T. Muller, G. Blin, J. Heringa, M. Nikolski, L. F. Wessels, et al. xheinz: An algorithm for mining cross-species network modules under a flexible conservation model. Bioinformatics, page btv316, 2015.
  14. [14].↵
    S. Epskamp, A.O. Cramer, L.J. Waldorp, V.D. Schmittmann, and D. Borsboom. qgraph: Network visualizations of relationships in psychometric data. Journal of Statistical Software, 48(4):1–18, 2012.
  15. [15].↵
    M. Girvan and M.E. Newman. Community structure in social and biological networks. Proceedings of the national academy of sciences, 99(12):7821–7826, 2002.
  16. [16].↵
    P. Gong, K. Gai, and C. Zhang. Efficient euclidean projections via piecewise root finding and its application in gradient projection. Neurocomputing, 74(17):2754–2766, 2011.
    OpenUrl
  17. [17].↵
    Z. Guo, Y. Li, X. Gong, C. Yao, W. Ma, D. Wang, Y. Li, J. Zhu, M. Zhang, D. Yang, et al. Edge-based scoring and searching method for identifying condition-responsive protein-protein interaction sub-network. Bioinformatics, 23(16):2121–2128, 2007.
    OpenUrlCrossRefPubMedWeb of Science
  18. [18].↵
    C.-L. Hsu, H.-F. Juan, and H.-C. Huang. Functional analysis and characterization of differential coexpression networks. Scientific reports, 5, 2015.
  19. [19].↵
    T. Ideker, O. Ozier, B. Schwikowski, and A.F. Siegel. Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics, 18(suppl 1):S233–S240, 2002.
    OpenUrlCrossRefPubMed
  20. [20].↵
    R.M. Karp. Reducibility among combinatorial problems. Springer, 1972.
  21. [21].↵
    M. Kivela, A. Arenas, M. Barthelemy, J.P. Gleeson, Y. Moreno, and M.A. Porter. Multilayer networks. Journal of Complex Networks, 2(3):203–271, 2014.
    OpenUrlCrossRef
  22. [22].↵
    N.E. Kouvaris, S. Hata, and A. Diaz-Guilera. Pattern formation in multiplex networks. Scientific reports, 5, 2015.
  23. [23].↵
    P. Langfelder and S. Horvath. Wgcna: an r package for weighted correlation network analysis. BMC bioinformatics, 9(1):559, 2008.
    OpenUrlCrossRefPubMed
  24. [24].↵
    W. Li, C.-C. Liu, T. Zhang, H. Li, M.S. Waterman, and X.J. Zhou. Integrative analysis of many weighted co-expression networks using tensor computation. PLoS Comput Biol, 7(6):e1001106, 2011.
    OpenUrlCrossRefPubMed
  25. [25].↵
    C.-b. Lin. Projected gradient methods for nonnegative matrix factorization. Neural computation, 19(10):2756–2779, 2007.
    OpenUrlCrossRefPubMedWeb of Science
  26. [26].↵
    Y. Liu, D.A. Tennant, Z. Zhu, J.K. Heath, X. Yao, and S. He. Dime: a scalable disease module identification algorithm with application to glioma progression. PloS one, 9(2), 2014.
  27. [27].↵
    H. Ma, E.E. Schadt, L.M. Kaplan, and H. Zhao. Cosine: Condition-specific sub-network identification using a global optimization method. Bioinformatics, 27(9):1290–1298, 2011.
    OpenUrlCrossRefPubMedWeb of Science
  28. [28].↵
    K. Mitra, A.-R. Carvunis, S.K. Ramesh, and T. Ideker. Integrative approaches for finding modular structure in biological networks. Nature Reviews Genetics, 14(10):719–732, 2013.
    OpenUrlCrossRefPubMed
  29. [29].↵
    P.J. Mucha, T. Richardson, K. Macon, M.A. Porter, and J.-P. Onnela. Community structure in time-dependent, multiscale, and multiplex networks. science, 328(5980):876–878, 2010.
    OpenUrlAbstract/FREE Full Text
  30. [30].↵
    Y. Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pages 372–376, 1983.
    OpenUrl
  31. [31].↵
    H.C. Poynton, J.M. Lazorchak, C.A. Impellitteri, B.J. Blalock, K. Rogers, H.J. Allen, A. Loguinov, J.L. Heckman, and S. Govin-dasmawy. Toxicogenomic responses of nanotoxicity in daphnia magna exposed to silver nitrate and coated silver nanoparticles. Environmental science & technology, 46(11):6288–6296, 2012.
    OpenUrl
  32. [32].↵
    T. Pramila, S. Miles, D. GuhaThakurta, D. Jemiolo, and L.L. Breeden. Conserved homeodomain proteins interact with mads box protein mcm1 to restrict ecb-dependent transcription to the m/g1 phase of the cell cycle. Genes & development, 16(23):3034–3045, 2002.
    OpenUrlAbstract/FREE Full Text
  33. [33].↵
    Y.-Q. Qiu, S. Zhang, X.-S. Zhang, and L. Chen. Detecting disease associated modules and prioritizing active genes based on high throughput data. BMC bioinformatics, 11(1):26, 2010.
    OpenUrlCrossRefPubMed
  34. [34].↵
    D. Rajagopalan and P. Agarwal. Inferring pathways from gene lists using a literature-derived network of biological relationships. Bioinformatics, 21(6):788–793, 2005.
    OpenUrlCrossRefPubMedWeb of Science
  35. [35].↵
    G.K. Smyth. Limma: linear models for microarray data. In Bioinformatics and computational biology solutions using R and Bioconductor, pages 397–420. Springer, 2005.
  36. [36].↵
    B. Stockinger and M. Veldhoen. Differentiation and function of th17 t cells. Current opinion in immunology, 19(3):281–286, 2007.
    OpenUrlCrossRefPubMed
  37. [37].↵
    J.M. Stuart, E. Segal, D. Koller, and S.K. Kim. A gene-coexpression network for global discovery of conserved genetic modules. science, 302 (5643):249–255, 2003.
    OpenUrlAbstract/FREE Full Text
  38. [38].↵
    R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
  39. [39].↵
    S. Tuomela, V. Salo, S.K. Tripathi, Z. Chen, K. Laurila, B. Gupta, T. Aijo, L. Oikari, B. Stockinger, H. Lahdesmaki, et al. Identification of early gene expression changes during human th17 cell differentiation. Blood, 119(23):e151–e160, 2012.
    OpenUrlAbstract/FREE Full Text
  40. [40].↵
    I. Ulitsky and R. Shamir. Identification of functional modules using network topology and high-throughput data. BMC systems biology, 1(1):8, 2007.
    OpenUrl
  41. [41].↵
    I. Ulitsky and R. Shamir. Identifying functional modules using expression profiles and confidence-scored protein interactions. Bioinformatics, 25(9):1158–1164, 2009.
    OpenUrlCrossRefPubMedWeb of Science
  42. [42].↵
    Y. Wang and Y. Xia. Condition specific subnetwork identification using an optimization model. Proc Optim Syst Biol, 9:333–340, 2008.
  43. [43].↵
    N. Yosef, A.K. Shalek, J.T. Gaublomme, H. Jin, Y. Lee, A. Awasthi, C. Wu, K. Karwacz, S. Xiao, M. Jorgolli, et al. Dynamic regulatory network controlling th17 cell differentiation. Nature, 496(7446):461–468, 2013.
    OpenUrlCrossRefPubMed
  44. [44].↵
    B. Zhang and S. Horvath. A general framework for weighted gene coexpression network analysis. Statistical applications in genetics and molecular biology, 4(1), 2005.
  45. [45].↵
    T. Zhang. Analysis of multi-stage convex relaxation for sparse regularization. The Journal of Machine Learning Research, 11:1081–1107, 2010.
    OpenUrl
  46. [46].↵
    X.-M. Zhao, R.-S. Wang, L. Chen, and K. Aihara. Uncovering signal transduction networks from high-throughput data by integer linear programming. Nucleic acids research, 36(9):e48–e48, 2008.
    OpenUrlCrossRefPubMed
  47. [47].↵
    Y. Zhao, E. Levina, and J. Zhu. Community extraction for social networks. Proceedings of the National Academy of Sciences, 108(18):7321–7326, 2011.
  48. [48].↵
    G.E. Zinman, S. Naiman, D. M. O Dee, N. Kumar, G.J. Nau, H.Y. Cohen, and Z. Bar-Joseph. Moduleblast: identifying activated sub-networks within and across species. Nucleic acids research, 43(3):e20–e20, 2015.
    OpenUrlCrossRefPubMed
  49. [49].↵
    H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.
    OpenUrlCrossRefWeb of Science
Back to top
PreviousNext
Posted June 03, 2016.
Download PDF

Supplementary Material

Email

Thank you for your interest in spreading the word about bioRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
Active modules for multilayer weighted gene co-expression networks: a continuous optimization approach
(Your Name) has forwarded a page to you from bioRxiv
(Your Name) thought you would like to see this page from the bioRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
Active modules for multilayer weighted gene co-expression networks: a continuous optimization approach
Dong Li, Shan He
bioRxiv 056952; doi: https://doi.org/10.1101/056952
Digg logo Reddit logo Twitter logo Facebook logo Google logo LinkedIn logo Mendeley logo
Citation Tools
Active modules for multilayer weighted gene co-expression networks: a continuous optimization approach
Dong Li, Shan He
bioRxiv 056952; doi: https://doi.org/10.1101/056952

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Bioinformatics
Subject Areas
All Articles
  • Animal Behavior and Cognition (4086)
  • Biochemistry (8761)
  • Bioengineering (6479)
  • Bioinformatics (23339)
  • Biophysics (11750)
  • Cancer Biology (9149)
  • Cell Biology (13246)
  • Clinical Trials (138)
  • Developmental Biology (7416)
  • Ecology (11369)
  • Epidemiology (2066)
  • Evolutionary Biology (15087)
  • Genetics (10398)
  • Genomics (14009)
  • Immunology (9120)
  • Microbiology (22040)
  • Molecular Biology (8779)
  • Neuroscience (47360)
  • Paleontology (350)
  • Pathology (1420)
  • Pharmacology and Toxicology (2482)
  • Physiology (3704)
  • Plant Biology (8050)
  • Scientific Communication and Education (1431)
  • Synthetic Biology (2208)
  • Systems Biology (6016)
  • Zoology (1249)