Abstract
Motivation The evolutionary history of biological networks enables deep functional and evolutionary understanding of various bio-molecular processes. Network growth models, such as the Duplication-Mutation with Complementarity (DMC) model, provide a principled approach to characterizing the evolution of protein-protein interactions (PPI) based on duplication and divergence. Current methods for model-based ancestral network reconstruction, primarily use greedy heuristics and yield sub-optimal solutions.
Results We present a new Integer Linear Programming (ILP) solution for maximum likelihood reconstruction of ancestral PPI networks using the DMC model. By construction, our model is designed to find the optimal solution. It can also use efficient heuristics from general-purpose ILP solvers to obtain multiple optimal and near-optimal solutions that may be useful in many applications. Experiments on synthetic and real data show that our ILP obtains solutions with higher likelihood than those from previous methods. We evaluate our algorithm on two real PPI networks, with proteins from the families of bZIP transcription factors and Commander complex. On both the networks, solutions from our ILP has higher likelihood and are in better agreement with independent biological evidence from other studies.
Availability A Python implementation is available at https://bitbucket.org/cdal/.
Contact vaibhav.rajan{at}nus.edu.sg
1 Introduction
An organism’s genotype and phenotype is mediated by complex biological interactions. Snapshots of such interactions are graphically captured by networks and spatio-temporal analysis of biological networks has led to deep functional and evolutionary understanding of molecular and cellular processes (Yamada and Bork, 2009). Knowledge of the evolution of networks such as Protein-Protein Interactions (PPI), metabolic and gene regulatory networks has been effectively used in the study of: molecular mechanisms in yeast (Wagner, 2001), cell signaling and adhesion genes (Nichols et al., 2006), modularity in metabolic networks of bacterial species (Kreimer et al., 2008), and of protein complexes (Pereira-Leal et al., 2006), functional modules from conserved ancestral protein-protein interactions (Dutkowski and Tiuryn, 2007), evolutionary trends of biosynthetic capacity loss in parasites (Borenstein and Feldman, 2009), regulatory network inference (Zhang and Moret, 2010) and essential and disease-related genes in humans (Vidal et al., 2011).
Generative models, called network growth models, that describe the evolution of networks have been used to explain properties of networks in other domains, such as the Preferential Attachment Model (Barabási and Albert, 1999) (for the World Wide Web) and the Forest Fire Model (Leskovec et al., 2005) (for social networks). These models encode assumptions of evolutionary processes in terms of graph operations. The key evolutionary process characterizing biological networks is duplication and divergence (Wagner, 2001). Thus each evolutionary step is modeled by duplication of a network node (including its incident edges) and deletion of some of the incident edges. Such models have been elucidated and validated in several biological studies (Chung et al., 2003; Vázquez et al., 2003). In this work we use the Duplication-Mutation with Complementarity (DMC) model, that has been found to fit PPI networks better than other commonly used network growth models (Middendorf et al., 2005; Navlakha and Kingsford, 2011).
Similar to reconstruction algorithms to infer evolutionary history of sequences, we can use a network growth model to obtain principled model-based reconstruction of ancestral networks. Assuming such a generative model, ancestral reconstruction seeks to find the most likely sequence of networks that yields the extant network. This entails inferring the order in which nodes duplicate and edges are lost at each step during evolution. Several algorithms have been designed for ancestral network reconstruction. An algorithm for maximum likelihood ancestral reconstruction based on the DMC model, called ReverseDMC, was developed by Navlakha and Kingsford (2011). ReverseDMC greedily (by maximizing the likelihood of that single step) chooses an anchor node that is duplicated, at each step of evolution.
ReverseDMC uses only extant network topology to infer ancestral networks. Variants that can use additional biological information of the extant proteins, when available, for ancestral reconstruction have also been proposed. Such additional information include protein duplication history (Li et al., 2013; Jasra et al., 2015) and evolutionary periods of proteins (Zhang et al., 2017). Other techniques for ancestral network reconstruction include the use of graphical models (Pinney et al., 2007), and parsimony-based approaches that find one or more ancestral reconstructions with the minimum number of interaction gain/loss events (Patro et al., 2012; Patro and Kingsford, 2013). These methods also use the gene duplication history and extant networks of multiple species during ancestral network reconstruction. Most of these methods, including ReverseDMC, yield only one evolutionary history, which is obtained by optimizing a mathematical criterion (like likelihood). In many applications it is useful to obtain multiple optimal and near-optimal histories to explore their biological relevance, through alternative criteria.
In this paper, we develop an Integer Linear Programming (ILP) solution for maximum likelihood reconstruction of ancestral PPI networks, using only extant network information. We use indicator variables to determine anchor and duplicated nodes at each step of evolution. Conditions imposed by the DMC model are formulated as linear constraints on each consecutive pair of networks during evolution. By construction, our algorithm can find the optimal solution, i.e., a solution that maximizes DMC-model based likelihood.
It is not known whether this problem is polynomial-time solvable. However, it appears to be unlikely, since the number of possible histories grows exponentially with each step. The advantage of an ILP framework is that it can leverage accurate and efficient heuristics, that are being steadily improved by the optimization community with readily available implementations in state-of-the-art general-purpose solvers (Gurobi, 2015). These improvements can automatically enhance the solution quality for the ancestral reconstruction problem. Another advantage of using ILP heuristics is that they can find multiple optimal and near-optimal solutions during their search of the solution space. Thus, they yield multiple reconstructions that can be examined for their biological relevance.
In experiments with synthetic datasets, our ILP solution obtains reconstructions with higher likelihood than those from ReverseDMC, which also shows that the greedy heuristic for this problem is not optimal. We evaluate our algorithm on two real biological networks, that contain protein-protein interactions from the families of bZIP transcription factors and Commander complex. Our ILP obtains solutions with higher likelihood on both these networks. We also examine the biological relevance of the results by comparing the inferred node arrival times as well as the chosen duplicated nodes at each evolutionary step, in reconstructions from ReverseDMC and ILP. By corroboration with independent biological evidence, we find that ILP produces better results.
2 Problem Statement
Given a network Gt at time t, and a model of evolution ℳ that specifies a series of operations that generates Gg+1 from Gg, we want to find the most probable sequence of networks GS = G1, G2, …, Gt−1:
We now describe the model that we use and how likelihood is computed for the model, as given in Navlakha and Kingsford (2011).
The duplication-mutation with complementarity (DMC) model assumes the first network to be a simple, connected two–node graph, has two parameters qcon and qmod, and network evolution, from any network Gg to Gg+1, proceeds as follows (see fig. 1):
DMC Model. Left: Yellow anchor node selected. Middle: Anchor node is duplicated, with edges to all neighbors. Right: Some edges to neighbors are deleted (with probability qmod /2, edge between the duplicated nodes retained with probability qcon.
An anchor node u in Gg is selected at random and duplicated to form node v. Initially v is connected to all neighbors of u and to no other nodes.
For each neighbor x of u (x is also a neighbor of v), the connecting edge (u, x) or (v, x) is modified with probability qmod; if the edge is to be modified, then with equal probability, either edge (u, x) or (v, x) is deleted.
Edge (u, v) is added with probability qcon.
Since each time-step adds a node we denote each network by the number of nodes contained in it: Gg is a network with g nodes.
Let euv denote the edge between the anchor (u) and duplicated node (v), that is set to 1 if the edge exists and is 0 otherwise. From step 2 of the DMC model, the probability that u and v share a particular neighbor is (1 –qmod) and the probability that a node x is a neighbor of u and not of v or a neighbor of v and not of u is qmod/2. Let N(u) denote the neighbors of u, the intersection N(u) ⋂ N(v) is the set of common neighbors of u and v and the symmetric difference N(u) Δ N(v) is the set of nodes that are neighbors of either u or v but not both. Then, given u and v are the anchor and duplicated nodes respectively in Gg, we have, ignoring constant terms:
Once u and v are identified, Gg−1 can be reconstructed by removing node v and adding edges between u and each node in N(u) – (N(u) ⋂ N(v)), since these edges were present before step 2 of the DMC model. Note that u and v are indistinguishable in Gg: either one of them may be deleted to form Gg−1 and the addition of edges follows mutatis mutandis. In the following we will refer to the pair of nodes u, v in Gg as duplicated nodes and u in Gg−1 as the anchor node.
3 ILP-based Solution
To recover the entire sequence GS, given the extant network Gt, we have to identify the following:
Anchor nodes in each of the networks G2, …, Gt−1,
Duplicated nodes in each of the networks G3, …, Gt,
Edges in each of the networks G3, …, Gt−1.
We will construct an Integer Linear Program (ILP) to obtain the solution. For each graph, G2, …, Gt, we will use binary edge indicators eijg that denote presence or absence of an edge and binary node indicators xig, yig, zig, aig. Subscripts i, j refer to nodes and g refers to network Gg that has nodes 1, …, g. We will set xig to 1 if the ith node in Gg is a duplicated node and aig to 1 if the ith node in Gg is an anchor node. To identify a common neighbor of the duplicated nodes, we will use the indicator yig and to identify a neighbor of either one of the duplicated nodes (but not both), we will use the indicator zig. Note that eijg, ∀i, j are known in networks G2 and Gt and unknown in all the other networks. All the binary node indicators are unknown in all the networks.
The log of the probability in equation 1 can now be expressed as:
Thus we want to maximize lP subject to all the constraints (2 to 23 below) posed by the extant graph and the model, which we shall now describe.
3.1 Anchors, Duplicated Nodes and Neighbors
Each network, except G2, has exactly 2 duplicated nodes:
Each network, except Gt, has exactly 1 anchor node:
The product eijgxig is 1 if and only if the ith node is a duplicated node and there is an edge from the jth node to the ith node. If the kth node is a common neighbor there should be exactly 2 edges to the duplicated nodes in the network. Since there are only 2 duplicated nodes per network, for the kth node, the sum
can take only three values: 0,1 or 2. and 1, constraint 4 sets ykg = 0 and for value 2, constraints 4 and 5 set ykg = 0.
To identify a neighbor of one of the duplicated nodes, but not both, i.e. to set zkg, there should be exactly 1 edge to the duplicated nodes in the network. We also have to ensure that one of the duplicated nodes, which may also satisfy this criterion if the duplicated nodes have an edge between them, is not selected. We can pose these constraints using an auxiliary binary node variable wkg:
Since there are only 2 duplicated nodes per network, for the kth node, the sum
can take only three values: 0,1 or 2.
If the value is 2, then constraints 4 and 5 ensure that ykg = 1 and constraint 6 sets wkg = 0 yielding zkg = 0 through constraint 8.
If the value is 1, then wkg = 1 since constraints 4 and 5 ensure that ykg = 0. In this case if xkg = 1 then constraint 9 ensures that zkg = 0 and if xkg = 0 then constraint 7 ensures that zkg = 1.
Finally, if the value is 0, then wkg = 0 (constraints 4, 5, 6) and zkg = 0 through constraint 8.
We use another binary node variable nkg to indicate a neighbor of a duplicated node, which may be a common neighbor or neighbor of either of the duplicated nodes:
3.2 Phantom Edges
During reconstruction, we have to learn the correspondence between nodes in Gg and nodes in the previous network Gg−1 to set the values of the unknown edges. In particular, we want to associate the duplicated nodes in network Gg with the anchor node in Gg−1. To learn this association, we use indicator variables for pairs of nodes (ig−1, jg) where the subscript indicates the network to which the node belongs. Since these are edges that do not exist in the network, but are artificial constructions for our inference, we call them phantom edges. We can view them as directed edges to a network from the previous network. See fig. 2 for an illustration.
Phantom edges between networks. Each node ig of a network Gg is connected to all the nodes jg in Gg through phantom edges . Not all phantom edges shown.
On each node jg in a network, except in G2, there must be exactly one incoming phantom edge from any of the nodes (ig−1) in the previous network:
From each node (ig−1) in the (previous) network, except from Gt, there must be at least 1 and at most 2 outgoing phantom edges. Anchor nodes will have 2 phantom edges and all other nodes will have only 1:
3.3 Edge Reconstruction
We now add the final set of constraints for edges in all the ancestral networks that are determined by the model and edges in the extant network. This is done by mapping edges from Gg to Gg−1 for which we will use the phantom edges. The known edges in the extant network shall be mapped backwards up to the first graph G2. We have to ensure the following three conditions:
An edge between duplicated nodes should not be mapped to any edge in the previous network since the duplicated node.
An edge (xg, ng) between a duplicated node xg and its neighbor ng in network Gg should be mapped to an edge (ag−1, ng−1) between the anchor ag−1 and its neighbor ng−1 in network Gg−1.
Any other edge should be mapped back to a unique edge in the previous network and there should be no other unmapped edge in the previous network.
To set these constraints, we will use three variables defined as follows. A binary indicator variable, for two nodes ig and jg in Gg, is defined as
It is non-zero if and only if there are two phantom edges from an anchor node kg–1 in Gg–1 to ig and jg in Gg. For edge(i,j), each term in
is the product of
. This term has value 1 iff
which creates a mapping from nodes i, j to the anchor node in the previous network. See fig. 3.
For each pair of nodes (i, j), we use phantom edges to find the appropriate mapping. Variables encode different possible conditions all of which are not true at the same time.
is used to map i, j to a single anchor node,
are used to map i, j to an anchor node and its neighbor and Tijg is used for all other cases.
Another binary indicator variable, for two nodes ig and jg in Gg, is defined as
It is non-zero if an only if there are two phantom edges from an anchor node al(g−1) and its neighbor (1 − ak(g−1)) connecting them respectively to ig and jg in Gg and there is an edge elk(g−1) in Gg−1. For a symmetric condition, for phantom edges from an anchor node al(g−1) and its neighbor (1 − ak(g−1)) connecting them respectively to jg and ig in Gg, we define another binary indicator variable, for two nodes ig and jg in Gg, as
Each term in the sums
and
is used to create a mapping from nodes i, j to an anchor node and its neighbor in the previous network. See fig. 3.
Finally, another binary indicator variable, for two nodes ig and jg in Gg, is defined as
It is non-zero if and only if there are two phantom edges from (any) nodes kg−1and lg−1 in Gg−1 to ig and jg in Gg respectively and there is an edge elk(g−1). Each term in Tijg is a product of phantom nodes incoming at i and j in Gg and the edge elk(g−1) in the previous network Gg−1, which when set to 1 creates a mapping from edge (i, j) ∈ Gg to edge (l, k) ∈ Gg−1. See fig. 3.
We set the constraints for each pair of nodes (ig, jg) in graph Gg based on node indicators for duplicated nodes (xig) and neighbor nodes (nig):
If both (ig, jg) are duplicated nodes, i.e. xigxjg = 1, then we have to set
to ensure that duplicated nodes connect to an anchor node in the previous network. Other indicators,
to ensure that no edge in Gg−1 is mapped to an edge, if any, between ig and jg. See fig. 4.
If the nodes (ig, jg) are such that one of them is a duplicated node and the other a neighbor, i.e. xignig = 1 or xjgnig = 1, then we set
so the anchor node in the previous network does not connect to this pair through any phantom edges, and we set
to ensure that phantom edges connect the anchor and its neighbor in the previous graph to nodes (ig, jg).
Note that there may not be an edge between (ig, jg), if jg is a neighbor to the other duplicated node and not ig as shown in fig. 5. This should still set the above constraints since both the duplicated nodes map to the anchor. We set Tijg = 1 to ensure that there is exactly one edge between (lg−1, kg−1) and Tijg ≥ eijg since there may or may not be an edge between (ig, jg).
Note that this and the previous cases are mutually exclusive since nig and xig are never both set to 1 for the same node.
If both the above cases are not true, i.e. xigxjg = xignig = xjgnig = 0, then we set
since we do not want an edge between (ig, jg) to map to any edge connecting to an anchor in the previous network and we set Tijg = eijg to ensure that there is a single edge (lg−1, kg−1) if eijg = 1. If eijg = 0, then this ensures there is no edge in the previous network mapped to (ig, jg). See fig. 6.
If both (ig, jg) are duplicated nodes (denoted by x), then shall connect the duplicated nodes to a single anchor node (denoted by a) in the previous network.
A duplicated node (x) and a neighbor (y or z) must connect to an anchor (a) and its neighbor in the previous network. This is done through the variables . Above: A duplicated node and neighbor of the other duplicated node, Below: A duplicated node and its own neighbor. Both have to be mapped to the same two nodes in the previous network.
Above: An edge between non-duplicated nodes is mapped back to an edge in the previous network. Below: If there is no edge between a pair of non-duplicated nodes, there should be no edge in the mapped nodes in the previous network.
The above three sets of conditions are incorporated in the following constraints:
We define an auxiliary binary variable Pijg that is set to 0 if
and
and 1 otherwise (i.e. the logical OR); also, we set Qijg = xigxjg:
Variables Tijg and eijg are set using Pijg and Qijg:
If Qijg = xigxjg = 1, then constraints 21 and 22 ensure that Tijg = Pijg = 0 since both
and
are 0. If Pijg = 0, Qijg = 1, then constraint 22 is void and constraint 23 ensures that eijg ≥ Tijg.
If Qijg = xigxjg = 0 and Pijg = 1 (i.e. either
or
is 1 which is only possible if xignig = 1 or xjgnig = 1) then constraint 20 ensures that Tijg = 1. If Pijg = 1, Qijg = 0, then constraint 23 is void and constraint 22 ensures that eijg ≤ Tijg.
If Qijg = xigxjg = 0 and Pijg = 0, then constraints 20 and 21 do not impose any value on Tijg. If Pijg = 0, Qijg = 0, then constraints 22 and 23 ensure that eijg = Tijg.
Finally, we set eijg = ejig ∀ig, jg, ∀g ∈{2, …, t} to ensure that the edges are undirected.
3.4 Linearization
The constraints as described above have terms that are products of binary variables and sums of such products. A constraint y = x1x2, …, xn where each variable is binary is equivalent to the following n + 1 constraints: y ≥ x1, y ≥ x2, … y ≥ xn, y ≤ x1 +x2 +…+xn − (n−1). Sums of products can be decomposed using auxiliary binary variables. For example, y = x1x2 + x3x4 can be expressed as y = z1 + z2, z1 = x1x2, z2 = x3x4 and further linearized using the previous rule.
3.5 Multiple Solutions
Since ILP is, in general, NP-hard, optimal solutions for very large networks may not be found in polynomial time. However, many heuristics have been developed to find multiple near-optimal solutions, e.g., see Wallace (2010), with efficient software implementations (Gurobi, 2015). These heuristics enable us to find multiple solutions and examine their biological relevance.
4 Experiments
4.1 Simulations
We simulated 1215 extant networks with number of nodes in the extant network varying from 6 to 10. For each network, evolution is simulated following the DMC model starting from an initial network of two connected nodes. The DMC model requires two parameters qcon and qmod. For each simulation, each parameter is randomly chosen from the closed interval [0.1, 0.9], rounded to one decimal. We reconstruct the network sequence using ReverseDMC, the Greedy approach of (Navlakha and Kingsford, 2011) and our ILP.
Likelihood Comparison
Table 1 shows that out of 1215 simulations, there were no simulations where solutions from ILP had a lower likelihood than that of ReverseDMC. Since these are relatively small networks, both ReverseDMC and ILP were able to find optimal solutions in many cases. The fact that ILP could find 274 solutions with higher likelihood shows that ReverseDMC is not guaranteed to find optimal solutions. Table 2 shows the summary statistics of the increase in log-likelihood due to ILP, compared to ReverseDMC solutions. The increase can range from 0.04 to 2.33, even in these relatively small networks.
Number of simulations where the log-likelihood of the reconstructed solutions were equal for both methods (2nd column), higher for ILP (3rd column) or higher for ReverseDMC (4th column).
Summary statistics of increase in log-likelihood among the 274 simulations where ILP reconstructions had higher likelihood than ReverseDMC reconstructions. SD: standard deviation.
4.2 Real Networks
We reconstruct the history of two protein-protein interaction networks using both ReverseDMC and ILP algorithms. In both algorithms, we assume qcon = 0.7, qmod = 0.4.
We evaluate the biological relevance of the results in two ways. First, we compare the node arrival times of the reconstructions following the procedure described in Navlakha and Kingsford (2011). The key idea is to estimate the protein arrival time using available ortholog information, with the assumption that proteins that arrive earlier in history have higher number of orthologs. Thus, the list of proteins in the extant network in descending order of number of orthologs is considered to be the ‘true’ node arrival order (AT). We determine the number of orthologs for each protein using OrthoDB (Kriventseva et al., 2018), by counting the number of genes at the highest level at which ortholog information was available for all the proteins in the networks (vertebrata for bZIP and metazoa for Commander). The reconstruction history of both Greedy and ILP identifies the removed node at each step: this provides the reconstructed node arrival order (AR) for each algorithm. AT and AR are compared using Kendall’s Tau (Kendall, 1945) that measures correlation between two ranked lists (definition given in appendix). Higher values indicate better correlation.
Our second evaluation is based on the sequence similarities between all the inferred anchors and duplicated nodes. Since at each time step in evolution (by the DMC model) the anchor gene (a) duplicates into another gene (d), we expect the pairwise similarity between a and d to be higher than the pairwise similarity between a and the remaining genes at that time step. Given the the extant network Gt and its reconstructed evolutionary history: ĜS = Ĝ3, …, Ĝt−1, Gt, along with chosen anchors and duplicated nodes in each network, we compute a score ρ(Ĝi) for each network in Ĝi ∈ĜS, using pairwise sequence similarity (Needleman and Wunsch, 1970) between the chosen anchor node protein and the duplicated node protein. The final score for the reconstruction, that we call Anchor-Duplicate Similarity Score (ADSS), is given by , where we normalize by the number of networks in ĜS. Ĝ2 is not considered since in the first evolutionary step (from Ĝ1 to Ĝ2) there is only one gene that duplicates and there are no other genes to compare with. Thus given two reconstructions of the same extant network, higher ADSS indicates better choice of anchor and duplicate nodes in the reconstruction.
bZIP Transcription Factors
The basic-region leucine zipper (bZIP) transcription factors are a protein family involved in many cellular processes including the regulation of development, metabolism, circadian rhythm, and response to stress and radiation (Amoutzias et al., 2006; Pinney et al., 2007). The interactions between these proteins are strongly mediated by their coiled-coil leucine zipper domains and so, the strength of these interactions can be accurately predicted using just sequence information (Fong et al., 2004). With the method of Fong et al. (2004), Pinney et al. (2007) constructed extant networks on a set of bZIP proteins for multiple species. We took the H. sapiens network and merged subunits for the same protein into one node, to obtain the extant network used in our experiment (fig. 7).
Extant bZIP network used in our experiment.
Table 3 shows the Likelihood, Node Arrival Time Accuracy (measured by Kendall’s Tau) and ADSS for Ancestral Reconstruction of the bZIP Network by both ReverseDMC and ILP. With respect to all three metrics, the solution obtained by ILP is better than that of ReverseDMC. Table 4 shows the order of arrival of proteins inferred by the reconstructions from ReverseDMC and ILP. Sequence-based phylogenetic analysis of bZIP transcription factors by Amoutzias et al. (2006) revealed a highly conserved ancient core network containing proteins JUN, FOS and ATF3, that provides additional evidence of the correctness of our reconstruction. In table 4 we observe that these three proteins appear early in the order inferred by ILP (before the seventh step) while JUN and ATF3 arrive after the seventh step in the order inferred by ReverseDMC.
Likelihood, Node Arrival Time Accuracy and ADSS for Ancestral Reconstruction of the bZIP Network
Arrival order of anchor proteins in the bZIP network, at each step of evolution, based on reconstructions from ReverseDMC and ILP.
Commander Network
Commander is a multiprotein complex that is broadly conserved across vertebrates and is involved in several roles including pro-inflammatory signaling and vertebrate embryogenesis (Mallam and Marcotte, 2017). A well characterized sub-complex of Commander, CCC, made of COMMD1, CCDC22, CCDC93 and C16orf62, is known to be involved in endosomal protein trafficking (Bartuzi et al., 2016; Mallam and Marcotte, 2017). Defects in the Commander complex are associated with developmental disorders (Mallam and Marcotte, 2017; Liebeskind et al., 2018). Reconstructing the evolutionary history of interactions in the complex can shed light on the conservation and stability of the proteins and their interactions, which in turn can aid understanding of the sources of dysfunction of the complex. We use the network discussed in (Liebeskind et al., 2018), shown in fig. 8, as the extant network for ancestral reconstruction.
Extant Commander network used in our experiment.
Table 5 shows the Likelihood, Node Arrival Time Accuracy (measured by Kendall’s Tau), and ADSS for ancestral reconstruction by both ReverseDMC and ILP. On this network too, on all three metrics, the solution obtained by ILP is better than that of ReverseDMC. Table 6 shows the order of arrival of proteins inferred by the reconstructions from ReverseDMC and ILP. Among all the commander proteins, COMMD1 is the best studied and is found to be highly conserved with multiple key functions (Riera-Romo, 2018). Indeed, in OrthoDB, COMMD1 has the maximum number of orthologs, among these proteins. In the reconstruction by ILP, COMMD1 is seen to arrive early, at the third step, while in the reconstruction from ReverseDMC it arrives only at the eighth step of evolution.
Likelihood, Node Arrival Time Accuracy and ADSS for Ancestral Reconstruction of the Commander Network
Arrival order of anchor proteins in the commander network, at each step of evolution, based on reconstructions from ReverseDMC and ILP.
5 Conclusion
We presented an Integer Linear Programming (ILP) based solution for maximum-likelihood reconstruction of the evolution of a PPI network using the Duplication-Mutation with Complementarity (DMC) model. We use indicator variables to determine anchor and duplicated nodes and derived the conditions imposed by the DMC model as linear constraints on each network, and on each consecutive pair of networks, at each step during evolution. By construction, the ILP can find the optimal solution and heuristics from general-purpose ILP solvers can be used to find multiple optimal and near-optimal solutions efficiently.
We compared the solutions obtained by our ILP with those from ReverseDMC (Navlakha and Kingsford, 2011), the previous best algorithm for this problem. On simulated data, we found that ILP always obtains solutions that are of equal or higher likelihood than those from ReverseDMC. We evaluated both the algorithms on two real PPI networks, containing proteins from the bZIP transcription factors and Commander complex respectively. On both the networks, solutions from our ILP had higher likelihood and were in better agreement with independent biological evidence from ortholog information and sequence similarity.
This is the first ILP solution to a model-based network reconstruction problem and the presented framework may be useful for other network models as well. The ILP framework could be generalized to handle multiple input networks as well as to take into account additional information, such as gene duplication histories. A limitation of our solution is that it can take considerably long to find reconstructions for very large networks. However, the ILP framework can yield deeper insights into the structure of the problem and specialized heuristics could be developed to obtain more efficient and scalable solutions.
Funding
V.R. was supported by Singapore Ministry of Education Academic Research Fund [R-253-000-139-114]. C.K. was partially supported in part by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4554 to C.K., by the US National Science Foundation (CCF-1256087, CCF-1319998) and by the US National Institutes of Health (R01GM122935). X.Z. was supported by grant #220558 from the Ragon Institute of MGH, MIT and Harvard.
Disclosure Statement
C.K. is a co-founder of Ocean Genomics.
Acknowledgements
Part of the work was performed during XZ’s visit at the Simons Institute for the Theory of Computing at University of California, Berkeley. We thank Rob Patro for sharing the bZIP network data used in their publication (Patro and Kingsford, 2013) which is originally from (Pinney et al., 2007).
Appendix
Appendix
Kendall’s Tau
We use the version that accounts for ties, given by , where P is the number of concordant pairs, Q the number of discordant pairs, T the number of ties only in AT, and U the number of ties only in AR. If a tie occurs for the same pair in both AT and AR, it is not added to either T or U. Here we consider pairs of observatios (xi, yi), (xj, yj) where xi, xj ∈ AT, yi, yj ∈ AR and i < j. A pair (xi, yi), (xj, yj) is concordant if the ranks of both elements agree, i.e., both xi < xj and yi < yj; or both xi > xj and yi > yj. A pair (xi, yi), (xj, yj) is discordant if xi > xj and yi < yj or if xi < xj and yi > yj. If xi = xj or yi = yj, it is considered a tie.