## 1 Abstract

Characterizing cellular responses to different extrinsic signals is an active area of research, and curated pathway databases describe these complex signaling reactions. Here, we revisit a fundamental question in signaling pathway analysis: are two molecules “connected” in a network? This question is the first step towards understanding the potential influence of molecules in a pathway, and the answer depends on the choice of modeling framework. We examined the connectivity of Reactome signaling pathways using four different pathway representations. We find that Reactome is very well connected as a graph, moderately well connected as a compound graph or bipartite graph, and poorly connected as a hypergraph (which captures many-to-many relationships in reaction networks). We present a novel relaxation of hypergraph connectivity that iteratively increases connectivity from a node while preserving the hypergraph topology. This measure, *B*-relaxation distance, provides a parameterized transition between hypergraph connectivity and graph connectivity. *B*-relaxation distance is sensitive to the presence of small molecules that participate in many functionally unrelated reactions in the network. We also define a score that quantifies one pathway’s downstream influence on another, which can be calculated as *B*-relaxation distance gradually relaxes the connectivity constraint in hypergraphs. Computing this score across all pairs of 34 Reactome pathways reveals two case studies of pathway influence, and we describe the specific reactions that contribute to the large influence score. Our method lays the groundwork for other generalizations of graph-theoretic concepts to hypergraphs in order to facilitate signaling pathway analysis.

## 2 Introduction

A major effort in molecular systems biology is to identify signaling pathways, the networks of reactions that link extracellular signals to downstream cellular responses. Computational representations of signaling pathways have increased in complexity, moving from gene sets to pairwise interactions in the past two decades [1]. Graphs are common representations of protein networks, where nodes are proteins and edges represent pairwise interactions between two proteins. While graph representations have been useful for pathway analysis [2–5] and disease-related applications [5–7], the limitations of graphs for representing biochemical reactions are well recognized [8–12].

Many pathway databases [13–20] have adopted reaction-centric signaling pathway formats such as the Biological Pathway Exchange (BioPAX) [21], which provides more mechanistic information about the interactions. As reaction-centric information has become available, many modeling frameworks have been proposed to overcome the limitations of graphs for analyzing signaling pathway structure [8, 9, 22, 23]. Compound graphs [24, 25] and metagraphs [8] aim to represent protein complexes and hierarchical relationships among molecular entities in the cell. Factor graphs [26] have been used to infer pathway activity from heterogeneous data types. Hypergraphs [27, 28] are generalizations of directed graphs that allow multiple inputs and outputs, and their realization as a model for signaling pathways is emerging [9, 11, 29]. Other models such as Petri nets [30] and logic networks [31, 32] move away from structural network analysis and towards discrete dynamic modeling. Many of these modeling frameworks have an underlying bipartite graph structure.

These new representations have improved fidelity to the underlying biology of signaling reactions but also exhibit increased mathematical and algorithmic complexity. In this light, we examine a fundamental topological concept: when are two molecules “connected” in a signaling pathway? Defining and establishing connectivity is the first step to determining downstream or upstream elements of a molecule, which may indicate the influence of its activity or the effect of its perturbation. Connectivity is also central to computational methods for identifying potential off-target effects, determining pathway crosstalk, and computing portions of pathways that may be altered in disease.

We first begin by considering existing connectivity measures on four distinct representations of the Reactome pathway database [13, 14]. We demonstrate that these measures range from highly permissive (e.g., path-based connectivity in graphs) to very restrictive (e.g., connectivity in directed hypergraphs), depending on the representation. Thus, two molecules may be “connected” in one representation of a pathway and “disconnected” in another representation. We then introduce *B*-relaxation distance, a parameterized relaxation of connectivity that offers a tradeoff between the permissive and restrictive representations. We show that this new version of connectivity uncovers more subtle structures within the pathway topologies than previous measures, and is sensitive to the presence of small molecules that that participate in many reactions. We then consider 34 Reactome signaling pathways and use *B*-relaxation distance to capture the downstream influence of one pathway on another. *B*-relaxation distance allows us to gradually relax the connectivity constraints in hypergraphs, calculating pathway influence at each step. The graph representation of Reactome is too highly connected to enable the discovery of such relationships. We also show that the bipartite graph representation, although not as highly as connected as the graph, does not support this type of result. We describe two case studies of pathway influence that we recovered, and describe the specific reactions that contribute to the large influence score.

## 3 Results

### 3.1 Connectivity analysis using established traversal algorithms

We considered four established directed representations of signaling pathway topology and their associated measures of connectivity (Fig. 1). Directed graphs describe relationships among molecules (proteins, and small molecules), while the other models describe relationships among entities that include proteins and small molecules, their modified forms, protein complexes, and protein families. Please refer to the Methods for full details about these representations, including how they are built.

**Directed graphs**represent molecules as nodes and interactions as pairwise edges. Interactions may be directed (such as regulation) or bidirected (such as physical binding). We use a Breadth First Search (BFS) traversal to find connected nodes.**Compound graphs**represent interactions between pairs of nodes, which may be molecules or groups of molecules (e.g., protein complexes or protein families). We use a previously-established algorithm that traverses the BioPAX structure as a compound graph according to biologically meaningful rules [25].**Bipartite graphs**contain two types of nodes: entity nodes and reaction nodes. Each biochemical reaction has an associated reaction node, whose incoming edges are connected to reactants and whose outgoing edges are connected to products. For each entity node, we use BFS to compute the set of connected entity nodes.**Directed hypergraphs**represent reactions with many-to-many relationships, where each hyperedge*e*= (*T*_{e},*H*_{e}) has a set of entities in the tail*T*_{e}and a set of entities in the head*H*_{e}. We adopt a definition of connectivity called*B*-connectivity that requires all the nodes in the tail of a hyperedge to be visited before it can be traversed [28]. This definition has a natural biological meaning in reaction networks:*B*-connectivity requires that all reactants of a reaction must be present in order for any product of that reaction is reachable [11,28].

We converted the Reactome pathway database to each of the four representations in an effort to determine if they agreed on connectivity (Table 1). The directed graph has far more edges than the other representations since it represents protein complexes as complete graphs (cliques). The hypergraph has more nodes than the graph since it represents protein complexes, families and modified forms as distinct entities. However, since each hyperedge is a multi-way relationship, the number of hyperedges is smaller than the number of edges in directed graphs. The compound graph and bipartite graph have the same node set as the hypergraph, but contain more edges since they describe relationships among entities using pairwise edges. Note that the number of nodes in the bipartite graph is the sum of the number of nodes and the number of hyperedges in the hypergraph, by definition.

In the directed graph representation, nearly 90% of the nodes reached over 80% of the network due to the large number of edges (Fig. 2A). For the other representations, we surveyed the same 19,650 entities representing proteins, small molecules, complexes, and families. We observed two sharp peaks for both the compound and bipartite graph representations: nodes that reach a large portion of the network and nodes that reach very few nodes in the network. Two-thirds of the nodes in the compound graph representation reach 50% of the network while half the nodes in the bipartite graph representation reach about 40% of the network (Fig. 2B–C). In the hypergraph representation, only five of the nodes are connected to more than 20 others in terms of *B*-connectivity, and most of the nodes cannot reach any others (Fig. 2D). In hypergraphs, the *B*-connectivity requirement of visiting all nodes in the tail of a hyperedge before traversal is overly strict for Reactome’s topology.

### 3.2 *B*-relaxation distance on hypergraphs

Connectivity in four different representations of Reactome largely exhibits an all-or-nothing behavior: nodes are either connected to very few or a large fraction of all other nodes. We introduce *B*-relaxation distance, a parameterized relaxation of hypergraph *B*-connectivity that naturally bridges the gap between *B*-connectivity in directed hypergraphs and connectivity in bipartite graphs. When we consider the connectivity from a node *v* in the hypergraph, nodes with a *B*-relaxation distance of 0 from *v*, denoted *B*_{0}, are exactly the nodes that are *B*-connected to *v*. Nodes with a *B*-relaxation distance of 1 (*B*_{1}) allows one hyperedge to be freely traversed, lifting the restriction that all nodes in the tail must be visited in order to traverse the hyperedge. In general, nodes with a *B*-relaxation distance of *k* (*B*_{k}) require *k* hyperedges to be freely traversed. For shorthand, we will denote *B*_{≤k} to be the set of nodes with a *B*-relaxation distance from a source node of at most *k*. A formal definition and efficient algorithms for computing *B*-relaxation distance appear in the Methods).

We computed the *B*-relaxation distance from every node in the hypergraph to every other node and plotted *|B*_{≤k}*|* for different values of *k* (Fig. 3A). The first column (*k* = 0) is the number of *B*-connected nodes for each source, a histogram of which is shown in Fig. 2D. The last column (*k* = 49) corresponds to the other extreme: for each source node, we display the number of nodes that are *B*-connected to the source while requiring that only one node in the tail of a hyperedge needs to be connected to the source for us to determine that every node in the head of the hyperedge is reachable from the source. The nodes reached for such a large value of *k* for each source are exactly the nodes that are connected to the source in the bipartite graph representation (Fig. 2C). As in Fig. 2C, we observe the nodes are divided into two sets: the top blue half are nodes that are not connected to very few others and the bottom yellow half are the nodes that are connected to about 40% of the bipartite graph.

The nodes in the bottom half of Fig. 3A exhibited a transition from reaching very few nodes (blue) to reaching many nodes (yellow). The rapidity of this transition suggested that a small number of nodes may be responsible for it. We hypothesized that these nodes may be small molecules, e.g., ATP, water, sodium and potassium ions, that participate in a vast number of reactions that are functionally unrelated. Consequently, we pruned the hypergraph by removing the 2,778 nodes labeled as small molecules by Reactome, as well as three other highly-connected entities (cytosolic Ubiquitin, nuclear Ubiquitin, and the Nuclear Pore Complex). We also removed hyperedges with an empty tail or head. In total, we altered 5,180 hyperedges by removing these entities, resulting in a filtered hypergraph with 15,440 nodes and 8,773 hyperedges. In this hypergraph, fewer nodes are connected to many others, and further the transition from low-to-high connectivity is more gradual across different source nodes (Fig. 3(B)). In contrast, removing small molecules from the directed graph changed the distribution very little, suggesting that small molecules played only a minor role in the the high level of connectivity in directed graphs (Supplementary Fig. S1). Others have noted that small molecules increase pathway connectivity through reactions that are not intended to be sequential, so we repeated the *B*-relaxation distance survey after removing 155 ubiquitous small molecules flagged by PathwayCommons [19] from the full hypergraph. As expected, the *B*-relaxation distance survey on this hypergraph reveals a pattern between the full hypergraph and the hypergraph with small molecules removed (Supplementary Fig. S2).

From these results, we concluded that we had a promising definition of parameterized distance that allowed us to relax the strict assumptions posed by *B*-connectivity, and a hypergraph where reachability was not affected by ubiquitous molecules that participate in many reactions. For the remainder of this study, we use the hypergraph with all small molecules removed (Fig. 3B).

### 3.3 Pathway influence across Reactome

While the entire Reactome pathway database appears to be poorly connected in the hypergraph representation, this determination comes from treating individual nodes as sources. We wished to leverage Reactome’s pathway annotations to understand how *pathways* are connected in the hypergraph according to *B*-relaxation distance. We identified 34 signaling pathways in Reactome (Supplementary Table S1) and considered the relationship between pairs of pathways within the hypergraph. When we computed the overlap of the members within each pair of pathways, we found that some pathway pairs already shared nearly all their members (Fig. 4A). For example, the normalized overlap between DAG/IP3 signaling and GPCR signaling is 0.9; DAG and IP3 are second messengers in the phosphoinositol pathway, which is activated by GPCRs. The next largest scores are 0.62 and 0.73 between Insulin Receptor signaling and Insulin-like Growth Factor 1 Receptor (IGF1R) signaling. Other growth factor pathways have moderate overlap (e.g., the overlaps among EGFR, ERBB2, and ERBB range from 0.24 to 0.32).

Our aim is to quantify how well a source pathway *S* can reach a target pathway *T* by finding pathway pairs where *T* is “downstream” of *S*. Since we wish to find a directed relationship between pathways, we should ignore the initial overlap between their member sets *P*_{S} and *P*_{T}. Thus, we developed a score that measures how many additional members of *T* may be reached when computing the *B*-relaxation distance from *S*, after accounting for the initial overlap and the total of number of elements that are reached from *S*. We defined the *influence score s*_{k}(*S, T*) of the source pathway *S* on target pathway *T* for *B*-relaxation distance up to *k* as follows:

This score makes use of the *pathway overlap* between *S* and *T* (*P*_{S} *∩ P*_{T}). The numerator counts the number of nodes in *T* that are reached in the set *B*_{≤k}(*P*_{S}) that are not already in *P*_{S}. The denominator counts the total number of nodes that are reached in *B*_{≤k}(*P*_{S}) that are not in the pathway overlap. Pathway pairs with a large initial overlap are penalized in this score, allowing more subtle patterns to emerge. Moreover, this score penalizes a pathway *P*_{S} that reaches many nodes indiscriminately.

We computed *s*_{k} for every pair of Reactome signaling pathways for *k* = 0, 1, 2, *…* (Fig. 4B). As *k* increases, a few pathway pairs exhibit a peak influence score around *k* = 3, including the largest computed influence score across all values of *k* (red box). There are three pairs that exhibit a large influence score for *k* = 3 (Fig. 4B): (a) the Mst1 pathway’s influence on MET signaling (*s*_{3} = 0.79), (b) the Activin pathway’s influence on TGF*β* signaling (*s*_{3} = 0.54), and (c) the BMP pathway’s influence on TGF*β* signaling (*s*_{3} = 0.48). We discuss these pathway pairs in two case studies: Mst1 and MET signaling followed by Activin/BMP, and TGF*β* signaling.

#### 3.3.1 Mst1 pathway influence on MET signaling

Using Macrophage-stimulating Protein 1 (Mst1) as the source pathway *S*, we computed the overlap of the other 33 pathways with *B*_{≤k} as *k* increases (Fig. 5). The largest influence score that we observed across all pathway pairs was 0.79 at *k* = 3 for Mst1 to MET signaling, which indicates that almost all the nodes downstream of Mst1 for *k* = 3 are MET pathway members. For *k* = 10, the set *B*_{≤k} contains many ERK1/ERK2 or PI3K/AKT pathway members; however, they comprise a relatively small portion of the total number of nodes in *B*_{≤k}.

Fig. 5 suggested that the Mst1 pathway may influence the MET pathway. An inspection of the literature and the topology of the nodes in *B*_{≤k} from the Mst1 pathway as the source lent support to this hypothesis. Mst1 is produced in the liver and is involved in organ size regulation [33, 34]. Mst1 acts like a hepatocyte growth factor and has been established as a tumor suppressor gene for heptacellular carcinoma [34]. MET, also known as hepatocyte growth factor (HGF) receptor, is a receptor tyrosine kinase that promotes tissue growth in developmental, wound-healing, and cancer metastasis [35]. Mst1, on the other hand, binds to Mst1R (also known as RON), which is a member of the MET family. Both MET and Mst1R have been shown to have similar downstream effects and can trans-phosphorylate when active [36]. Upon inspection of the reactions that involved the nodes *B*_{≤3}, we found that Hepsin (HPN) was involved in forming both the Mst1 dimer and HGF dimer (Fig. 6). This protease is known to cleave both pro-Mst1 and pro-HGF into active Mst1 and HGF [37]. The hypergraph also emphasizes the fact that the nodes that in *B*_{≤k} but are not in the MET pathway involve STAT regulation in different cellular compartments. The computed pathway influence (observed as an enrichment of stars in Fig. 6 in the regions named *B*_{0}, *B*_{1}, *B*_{2}, and *B*_{3}) is due to HPN’s role in activating the ligands responsible for both Mst1 signaling and MET signaling. Fig. 6 also displays the nodes in *B*_{4}. The high prevalence of nodes that are not in the Met pathway (circles) in this region reinforces the fact that the influence of the Mst1 pathway on the Met pathway is the largest for *k* = 3.

#### 3.3.2 Activin and BMP influence on TGF*β* signaling

Following the influence score for Mst1 and MET pathways, the next three largest scores across all pathway pairs and all values of *k* were for the Activin pathway on TGF*β* signaling (*s*_{2} = 0.58, *s*_{3} = 0.54) and the Bone Morphogenic Protein (BMP) pathway on TGF*β* signaling (*s*_{3} = 0.48). The pattern of *s*_{k} values for Activin and TGF*β* were strikingly similar to the trends for BMP and TGF*β* pathways; for both Activin and BMP, TGF*β* was the only target pathway that received a large influence (Fig. 7). Even though Activin, BMP, and TGF*β* are all known ligands of the TGF*β* superfamily, our analysis demonstrates that the Activin and BMP pathways are upstream of the TGF*β* pathway. The TGF*β* superfamily regulates processes involved in proliferation, growth, and differentiation through both SMAD-dependent and SMAD-independent signaling [38]. TGF*β*, Activin, and BMP phosphorylate different SMAD proteins by forming dimers and binding to receptor serine/threonine kinases. TGF*β* binds to TGF*β* Receptor II (TGFBR2), which forms a homeodimer with TGFBR1 and activates SMAD2 and SMAD3. Activin also phosphorylates SMAD2 and SMAD2 through binding and activation of the Activin A receptor (ACVR). BMP, on the other hand, phosphorylates SMAD1, SMAD5, and SMAD8 through BMP receptor activation. The hypergraph that shows the nodes in *B*_{≤3} from Activin consists of different components and many cycles that denote reuse of SMADs (Supplementary Fig. S3). The hypergraph suggests that the influence of Activin on TGF*β* does not begin at the ligand, but rather at the activation of SMAD proteins.

## 4 Discussion

Connectivity is a foundational concept in cellular reaction networks, since it lies at the heart of determining the effect of one molecule upon another. The formal definition of connectivity is familiar and straightforward in directed graphs, the most common mathematical representation of reaction networks. However, precisely capturing this concept is challenging in more sophisticated and biologically accurate representations such as compound graphs, bipartite graphs, and directed hypergraphs. In recent years, scientists have developed these definitions independently for each of these representations.

This work is the first to systematically compare the relevant formulations of connectivity in four different models of reactions in signaling pathways. We study their impact on the Reactome database. Our striking finding is that the directed graph representation of Reactome is very highly connected (90% of the nodes reach over 80% of the graph), the compound and directed graph versions are somewhat less connected (two thirds of the nodes in the compound graph are connected to about half the nodes and half the nodes in the bipartite graph reach about 40% of the nodes), whereas the directed hypergraph model exhibits very poor connectivity (only five nodes are connected to more than 20 nodes).

We attribute this trend to multiple, related factors. The SIF format for Reactome, from which we construct the directed graph, does not distinguish between modified forms of a protein and represents complexes as cliques. Compound graphs, bipartite graphs, and directed hypergraphs create a node for each form of a protein and for each protein complex. However, compound and bipartite graphs are much more connected than hypergraphs since they record multi-way reactions using multiple, independent edges. Directed hypergraphs accurately represent reactions, but their biologically-meaningful definition of connectivity (*B*-connectivity) is very restrictive in practice.

Motivated by these findings, we have provided a relaxed version of hypergraph connectivity, *B*-relaxation distance, that is tailored for the analysis of signaling pathways. *B*-relaxation distance takes the intuitive mechanical significance of *B*-connectivity and grants it the leeway necessary to deal with the challenges presented by the topologies of biomolecular hypergraphs. We show that *B*-relaxation distance elegantly bridges the gap between bipartite graphs and hypergraphs.

We use *B*-relaxation distance to identify downstream influence between annotated pathways in Reactome, defining an influence score *s*_{k} that suggests how much a target pathway *T* might be influenced by the downstream effects of a source pathway *S*. After performing an all-vs-all comparison across 34 Reactome pathways, we demonstrate the ability of *B*-relaxation distance to capture points of influence in two case studies: (a) the effect of the Mst1 pathway on MET signaling and (b) the role of Activin and BMP pathways on TGF*β* signaling. Visualizing the hypergraph that contains nodes with small *B*-relaxation distance can pinpoint the exact reaction or reactions responsible for the influence of one pathway on another. While our findings are not biologically novel, they demonstrate how researchers may explore Reactome in a systematic, unbiased manner to identify possible points of influence among pathways. As pathway databases such as Reactome continue to expand, *B*-relaxation distance will become a useful measure for systematically characterizing connectivity and relationships among annotated pathways.

Our algorithm for *B*-relaxation distance runs in polynomial time, and is efficient in practice. However, using directed hypergraphs to solve other computational problems can come with additional algorithmic challenges. For example, the shortest path problem on graphs is widely known to be solvable in polynomial time, while the analogous problem on directed hypergraphs is NP-complete [11, 28], even when bounding the number of nodes in the tail and head sets [29]. These challenges invite the generalization of other classic graph algorithms that have been used in biological applications to directed hypergraphs; in fact, random walks [39] and spectral clustering [40] have already been developed for directed hypergraphs with applications to other fields.

## 5 Methods

### 5.1 Connectivity measures

Given a pathway and two entities, we wish to ask a very fundamental connectivity question: “is *a* downstream of *b*”? The answer to this question in directed graphs can be efficiently computed using a traversal algorithm such as breadth first search. Established connectivity measures on compound graphs [25] and hypergraphs [28] generalize breadth-first traversal. We begin with hypergraph connectivity and then describe our proposed relaxation to this measure, which is the main computational contribution in this work. We then describe another version of connectivity for compound graphs, which lies conceptually between graph connectivity and hypergraph connectivity.

#### 5.1.1 Hypergraph connectivity

A directed hypergraph = ℋ(*V,*ℰ) contains a set *V* of nodes and a set ℰ of *hyperedges*, where a hyperedge *e* = (*T*_{e}, *H*_{e}) *∈*ℰ consists of a tail set *T*_{e} *⊆V* and a head set *H*_{e} *⊆V* of nodes [28]. The *cardinality* of hyperedge *e* is the sum of the nodes in the tail and head, i.e., *|T*_{e}*|* + *|H*_{e}*|*. Note that directed graphs are a special case of directed hypergraphs where *|T*_{e}*|*= *|H*_{e}*|*= 1 for each hyperedge *e*. In a directed graph, the set of nodes connected to some source *s* is simply all nodes that are reachable via a path from *s*. The equivalent notion in a directed hypergraph is *B*-connectivity. Given a set of nodes *S ⊆V, B*-connectivity ensures the property that traversing a hyperedge *∈*ℰ requires that all the nodes in *T*_{e} are connected to *S*. The following definition is adapted from Gallo et al. [28]:

Given a directed hypergraph ℋ= (*V,* ℰ) and a source set *S ⊆ V*, a node *u ∈ V* is *B***-connected** to *S* if either (a) *u ∈ S* or (b) there exists a hyperedge *e* = (*T*_{e}, *H*_{e}) where *u ∈ H*_{e} and each element in *T*_{e} is *B*-connected to *S*. We use *B*(ℋ, *S*) to denote the set of nodes that are *B*-connected to *S* in ℋ.

We can compute *B*(ℋ, *S*) using a hypergraph traversal [28]. This traversal works by finding hyperedges that have tails whose nodes are all *B*-connected to *S*, augmenting the set of *B*-connected nodes with the nodes in the heads of these hyperedges, and repeating this process until it does not discover any new nodes. The running time of this algorithm is linear in the size of ℋ.

#### 5.1.2 Parameterized hypergraph connectivity

While *B*-connectivity is a biologically useful notion of connectivity, it is overly restrictive for the purpose of assessing the connectivity of pathway databases. We establish a relaxation of *B*-connectivity which works around such restrictions. Before we formally define *B*-relaxation distance, we distinguish different sets of hyperedges based on their association with the source set *S* (Fig. 8).

Given a hypergraph ℋ = (

*V,*ℰ) and a source set*S ⊆V*, a hyperedge*e*= (*T*_{e},*H*_{e}) is**reachable**from*S*if at least one element of*T*_{e}is*B*-connected to*S*.Given a hypergraph ℋ = (

*V,*ℰ) and a source set*S ⊆ V*, a hyperedge*e*= (*T*_{e},*H*_{e}) is**traversable**from*S*if all elements of*T*_{e}are*B*-connected to*S*.Given a hypergraph ℋ = (

*V,*ℰ) and a source set*S ⊆ V*, a hyperedge*e*is**restrictive**(with respect to*S*) if it is reachable but not traversable from*S*. We use*R*(ℋ,*S*) to denote the set of restrictive hyperedges.

We modify the `b_visit()` algorithm from [28] to return the *B*-connected set *B*(ℋ, *S*) and the restrictive hyperedges *R*(ℋ, *S*) (Algorithm 1). The main difference between this traversal and a typical BFS is that a hyperedge is traversed only when all the nodes in the head have been visited. We also return the set of traversed hyperedges to avoid redundant computation in the relaxation algorithm that we describe later.

We iteratively relax the notion of *B*-connectivity by allowing restrictive hyperedges to be traversed; to do so, at each iteration *k* we need to keep track of *B*_{k}(ℋ, *S*), the connected nodes, and *R*_{k}(ℋ, *S*), the restrictive hyperedges. We initialize these sets to be the outputs of `b_visit()`:

In the *k*th iteration of this relaxation process, we consider the heads of each restrictive hyperedge *e* from the previous iteration. *B*_{k}(ℋ, *S*) is the set of *B*-connected nodes and *R*_{k}(ℋ, *S*) is the set of restrictive hyperedges for each head set from *R*_{k-1}(ℋ, *S*):

Note that computing *R*_{k}(ℋ, *S*) using this definition requires |*R*_{k-1}(ℋ, *S*)| different `b_visit()` calls, which is necessary to ensure that only one restrictive hyperedge is used to establish connectivity. With these definitions in hand, we are now ready to define our relaxation of *B*-connectivity.

Given a hypergraph ℋ= (*V,* ℰ), a source set *S ⊆ V*, and an integer *k ≥* 0, a node *v ∈ V* is *B*_{k}**-connected** to *S* if *v ∈ B*_{i}(ℋ, *S*) for *i* = 0, 1, *…, k*.

The *B-relaxation distance* of a node *v* from a source set *S* is the smallest value of *k* such that *v* is *B*_{k}-connected to *S* in ℋ. In the main text, we use *B*_{≤k} to denote the *B*_{k}-connected set. An example of computing *B*-relaxation distance for all nodes in a hypergraph is shown in Fig. 9.

We calculate the *B*-relaxation distance from *S* to every node in the hypergraph by calling `b_visit()` on restrictive hyperedges for *k* = 0, 1, 2, *…* (Algorithm 2). The algorithm first calls `b_visit()` from *S* to get the *B*-connected set *B*_{0},^{1} the restrictive hyperedges *Q*_{0}, and the traversed hyperedges *X* (line 1). The *B*-relaxation distance dictionary `dist` is initialized to 0 for nodes in *B*_{0} and infinity otherwise, and the `seen` dictionary of hypereges set to `True` if they have been traversed and `False` otherwise. While there are unseen restrictive hyperedges to traverse, the algorithm computes *B*_{k} and *R*_{k} by calling `b_visit()` on the heads of each restrictive hyperedge from iteration *k* 1 (lines 7–11). We update the `seen` dictionary with all traversed hyperedges from each `b_visit()`, since these hyperedges may be restrictive with respect to another set of nodes and would be recomputed at a later iteration (lines 12–13, Supplementary Fig. S5). Finally, the algorithm updates the `dist` dictionary for all nodes that are reached in the *k*-th iteration and increments *k* (lines 14–17). This implementation keeps track of *B*_{0}, *B*_{1}, *…, B*_{k} and *R*_{0}, *R*_{1}, *…, R*_{k}, which may be returned for other purposes.

##### Runtime analysis

The original `b_visit()` from Gallo et al. runs in 𝒪 (*size*(ℋ)) time where *size*(ℋ) refers to the sum of the hyperedge cardinalities in ℋ [28]. The modified `b_visit` incurs no additional asymptotic runtime cost since the timing of the additional operations it conducts (Algorithm 1, lines 13-16) is trivially bounded by *|*ℰ*|*, which is bounded by *size*(ℋ).

In Algorithm 2, initializing the `dist` and `seen` dictionaries takes *|V|* and *|*ℰ*|* time, respectively. The while loop (line 5) contains two for loops. The first loop in line 7 iterates over all restrictive hyperedges, performing work only when that hyperedge has not been previously traversed. Thus, the code in the first loop will be executed at most *|*ℰ*|* times over the full course of the algorithm, corresponding to the case where every hyperedge in ℋ appears in some restrictive set. The first loop calls `b_visit()` in line 9 at each iteration, which runs in 𝒪 (*size*(ℋ)) time as previously mentioned. The second loop in line 14 updates the *B*-relaxation distance of each node exactly once, when it is first discovered by the algorithm. It will be executed at most *|V|* times over the full course of the algorithm. The running time of the first loop (line 7) dominates those of the initialization steps and the distance update loop; thus, the runtime of Algorithm 2 is 𝒪 (*|*ℰ*| · size*(ℋ)).

##### Pre-processing speedup

When we ran `b_relaxation()` on each source node on the Reactome hypergraph, the algorithm took an average of 31.6 seconds per node on a Linux machine with quad Intel Core i7-4790 processors. The quadratic runtime is tractable for a handful of calls, but calling `b_relaxation()` from every vertex in *V* (as we do in this work) will result in a cubic runtime. We formulated an optimized version of `b_relaxation()`, which we initialized by calling `b_visit()` on *H*_{e} for each *e∈*ℰ and recording the resulting connected nodes and restrictive hyperedges. This initialization step incurs a cost of *|*ℰ*| size*(ℋ) time, but replaces the call to `b_visit()` in line 9 with a constant-time lookup operation. Thus the sole quadratic term in the runtime of Algorithm 2 becomes linear in the optimized version. The optimized version, when applied to each source node on the Reactome hypergraph, gave an average running time of 0.310 seconds per node, giving an improvement of two orders of magnitude.

#### 5.1.3 Compound graph connectivity

There are multiple definitions of compound graphs [8, 25]. Here we describe *compound pathway graphs CP* = (*G, I*) that consist of two graphs [25]. The pathway graph *G* = (*V, E*_{G}) is a mixed graph where *V* denotes the set of nodes and *E*_{G} denotes the interaction and regulation edges among nodes, some of which may be directed.^{2} The inclusion graph *I* = (*V, E*_{I}) is on the same node set *V* and *E*_{I} denotes the undirected edges for defining compound structure membership (e.g., complexes and abstractions). To traverse a compound pathway graph, we need, for each compound structure, two flags: (a) `compound`: if a compound structure is reached, are all its members also reached? and (b) `member`: if a member of a compound structure is reached, are all other members in the compound structure also reached? During the traversal, once a node *u* is reached, the algorithm determines if any other nodes are “equivalent” to *u* based on these flags. Note that while compound graphs handle traversals through entities such as protein complexes and families, the edges only connect pairs of these entities. Thus, the requirements imposed by *B*-connectivity on hypergraphs cannot be implemented on compound graphs as they are currently defined.

A *compound path* between two nodes consists of edges that are either from the pathway graph *E*_{G} or represent a link between nodes that are equivalent for traversal based on the `compound` and `structure` flags. These compound paths are used to establish the set of nodes that are downstream of a source node. For comparison with other measures, we modify the definition from [25] to ignore activation/inhibition effects and remove a restriction on path lengths:

Given a compound pathway graph *CP* = (*G, I*) and a source set *S ⊆ V*, a node *u ∈ V* is **downstream** of *S* in *CP* if there exists some compound path from any node *s ∈ S* to *u* in *CP*.

We run the DOWNSTREAM algorithm implemented in the PaxTools software [25, 41] on each source node in *S*, ignoring activation/inhibition sign and the path length limit.

### 5.2 Data formats and representations

We automatically generate the four Reactome representations – directed graph, compound graph, bipartite graph, and hypergraph – using a suite of tools (Fig. 10). We use PathwayCommons, a unified collection of publicly-available pathway data [19], to collect BioPAX and SIF files representing the entire Reactome database (http://www.pathwaycommons.org/archives/PC2/v10/). The SIF files are generated by PathwayCommons by converting BioPAX relationships to binary relations; more details are available at http://www.pathwaycommons.org/pc2/formats. We convert the SIF files to a directed graph by converting each binary relation to a directed or bidirected graph (Supplementary Table S2).

We use the PaxTools java parser to work with BioPAX files [41]. PaxTools offers querying algorithms such as `DOWNSTREAM` that operates on the compound graph representation [25]. We use PaxTools to construct hypergraphs by traversing the BioPAX files. For each biochemical reaction in BioPAX, we construct a hyperedge with the reactants and control elements in the tail and the products in the head. We use the algorithms provided in the Hypergraph Algorithms Package (HALP, http://murali-group.github.io/halp/) to work with hypergraphs. The *B*-relaxation distance algorithm is provided in a developmental branch of HALP. Finally, we build the bipartite graph directly from the hypergraph, converting each hyperedge *e* into a reaction node *r* and connecting the tails of *e* to *r* and then *r* to the heads of *e*. Thus, the number of nodes in the bipartite graph is exactly the number of nodes plus the number of hyperedges in the hypergraph, and large *B*-relaxation distance corresponds to traversing the bipartite graph.

We visualize hypergraphs using GraphSpace [42], a web-based collaborative network visualization tool. The hypergraphs are available as interactive networks on GraphSpace using the with the `GLBio2019` tag (http://graphspace.org/graphs/?query=tags:glbio2019).

## Acknowledgments

We thank Brendan Avent for his initial work on the hypergraph algorithms library and Ozgun Babur for discussions about BioPAX and PaxTools. This work is supported by the National Science Foundation under grants DBI-1750981 (to PI AR) and CCF-1617678 (to PI TMM).

## Footnotes

↵* Current Affiliation: Department of Computer Science, University of Maryland College Park, College Park, MD, US

↵

^{1}In the algorithm we drop the parameterization of the hypergraph and source set to declutter notation.↵

^{2}Edges may also denote inhibition/activation; here, we ignore this aspect of the compound graph.