Crosstalkr: An open-source R package to facilitate drug target identification

In the last few decades, interest in graph-based analysis of biological networks has grown substantially. Protein-protein interaction networks are one of the most common biological networks, and represent the molecular relationships between every known protein and every other known protein. Integration of these interactomic data into bioinformatic pipelines may increase the translational potential of discoveries made through analysis of multi-omic datasets. Crosstalkr provides a unified toolkit for drug target and disease subnetwork identification, two of the most common uses of protein protein interaction networks. First, crosstalkr enables users to download and leverage high-quality protein-protein interaction networks from online repositories. Users can then filter these large networks into manageable subnetworks using a variety of methods. For example, network filtration can be done using random walks with restarts, starting at the user-provided seed proteins. Affinity scores from a given random walk with restarts are compared to a bootstrapped null distribution to assess statistical significance. Random walks are implemented using sparse matrix multiplication to facilitate fast execution. Next, users can perform in-silico repression experiments to assess the relative importance of nodes in their network. At this step, users can supply protein or gene expression data to make node rankings more meaningful. The default behavior evaluates the human interactome. However, users can evaluate more than 1000 non-human protein-protein interaction networks as a result of integration with StringDB. It is a free, open-source R package designed to allow users to integrate functional analysis using the protein-protein interaction network into existing bioinformatic pipelines. A beta version of crosstalkr available on CRAN (https://cran.rstudio.com/web/packages/crosstalkr/index.html).

PPI filtration is typically performed using a walk-based algorithm or through integration with additional phenotypic data 24 such as gene expression 4,11 . One of the most well-studied algorithms in this context is random walk with restarts. Random 25 walk with restarts (RWR) has been used and adapted across disciplines and industries for applications ranging from internet 26 search engines to drug target identification [18][19][20] . There are a growing suite of tools available in R for analyzing graph-structured 27 data 21,22 , including a few R packages that implement RWR 23,24 . These tools require some understanding of graph data 28 structures and ask the user to find, download, and manipulate the relevant biological networks into adjacency matrices or 29 igraph objects. In this paper, we will describe crosstalkr, a free, open-source R software package. Crosstalkr provides a 30 streamlined interface to facilitate all three of the most common steps of an interactomic analysis. Users can interface with 31 online PPI repositories, prune or filter the resulting networks, and rank nodes based on a variety of graph-based scoring methods. 32 In addition, crosstalkr is optimized to facilitate one-line implementation of an RWR-based algorithm designed to identify 33 functional subgraphs of protein-protein interaction networks (PPI) 11 . In addition to providing a clean interface for non-graph 34 theorists, crosstalkr improves upon existing tools by implementing methods to facilitate in silico repression. Crosstalkr facilitates all 3 of the most common steps in an interactomic pipeline; interfacing with online PPI repositories, 37 pruning the resulting network, and ranking nodes to produce scored gene sets ( Figure 1). We implemented this functionality in 38 the "load_ppi", "gfilter", and "node_repression" methods (described below). 39 Figure 1. Example workflow for crosstalkr software package. crosstalkr supports all 3 of the most common steps in an interactomic pipeline; interfacing with online PPI repositories, pruning the resulting network, and ranking nodes to produce scored gene sets.

40
'load_ppi' allows users to interface directly with online protein interaction repositories. Currently, we support StringDB 9 , 41 and Biogrid 8 . If users specify StringDB as the ppi, they can customize the incoming PPI based on species code, interaction 42 confidence score, and interaction type. Of note, StringDB incorporates Biogrid curated interactions in their network inference 43 process 9 . Biogrid can therefore be thought of as a more strict subset of the PPI provided by StringDB. load_ppi will standardize 44 the incoming data and return an igraph object where proteins are vertices and binary interactions are undirected edges. However, 45 users do not necessarily need to interact with load_ppi directly. The methods below will call load_ppi if the 'use_ppi' bool 46 is set to TRUE. Finally, users are encouraged to provide a file path to the 'cache' argument. The returned igraph object will then be stored at the local file path as a .Rda file rather than requiring users to repeatedly download resources from online 48 repositories. By default, cached resources will be used when load_ppi is called subsequently.

50
'gfilter' and related methods allow users to reduce large graphs to subgraphs based on a user-supplied method. Users can supply 51 a graph as an igraph object or use a PPI network. All node scoring methods in the igraph package 21 are supported, in addition 52 2/11 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted March 10, 2023. ; https://doi.org/10.1101/2023.03.07.531526 doi: bioRxiv preprint to simple node ranking based on a user-provided named numeric vector (i.e. gene expression). Node scoring methods also 53 include one thermodynamic measure, network potential (also called Gibbs free energy) 25 . Finally, users can use an RWR-based 54 method (described below). In all cases, users specify the number of nodes to be kept (n), and whether nodes should be ranked 55 in descending order. After computing the node score based on the user-supplied method, a subgraph is creating using the 56 'induced_subgraph' igraph method.

58
In the compute_crosstalk function, we implement a system for identification of phenotype-specific subnetworks. If users plan 59 to search a supported protein protein interaction network, they are only required to provide a vector of seeds proteins. Seed 60 proteins are used as the starting point for construction of the phenotype-specific sub-network, analogous to search terms in a 61 web search. In this situation, compute_crosstalk will: 62 1. Download the requested PPI (or load it from the provided cache) 63 2. Process the requested PPI into a sparse adjacency matrix. 3. Perform a random walk with restart using the user provided seeds to generate affinity scores for every protein in the PPI.
4. Perform many random walks with restarts from n random seeds with a matching degree distribution to generate a null 66 distribution of affinity score.  The node repression function provides support for in silico repression, a method to rank nodes for the identification of potential 74 drug targets. In silico repression attempts to score the importance of a given node by computing some global measure of 75 network state before and after node removal. A more formal definition of our in silico repression implementation can be 76 found in Section 5. Currently, network potential (sometimes called Gibbs Free Energy) is the only state function available.

77
Generalizing node_repression to any state function that scores nodes is an active area of development. Here, We present two possible bioinformatic pipelines for drug target identification. Both rely on the integration of RNA or 86 protein expression data from a given model system with interactomic data from protein-protein interaction networks. Integration 87 of specific phenotypic data helps address the shortcoming that publicly available PPI repositories are not context-specific 26 . In 88 the first, we execute a two-step pipeline, where the PPI network is filtered based on RNA expression data. We then rank the 89 nodes again based on betweenness centrality to identify the most critical hub genes of the network. In the second, we use in 90 silico repression to rank the nodes rather than a simple node scoring method after a filtration step.

91
For these examples, we will rely on previously published gene expression data from a Ewing Sarcoma cell line (A673)g 92 (described in more detail here). Ewing Sarcoma is a rare bone malignancy primarily seen in pediatric and adolescent patients. It 93 is thought to be driven by a fusion event that creates an abberant EWS-FLI1 transcription factor 27 . EWS-FLI1 then activates 94 downstream effectors that drive cellular growth and proliferation. The specific transcriptional effects have been studied 95 extensively in the last 30 years [28][29][30] . Despite improved knowledge about the molecular mechanisms that cause Ewing Sarcoma, 96 meaningful clinical improvements have not been made in decades 31 . While it is not our goal here to suggest specific targets 97 for therapy, novel therapeutics with efficacy in Ewing Sarcoma are sorely needed. We will first isolate a single replicate for 98 convenience and log2 normalize (E = log 2 (E)) the gene expression values E. Then we will format the expression values into a 99 named numeric vector, the preferred format for interacting with crosstalkr. . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made First, we will use the gene expression values to reduce the size of the PPI network we need to analyze (Listing 1). gfilter.value performs the following actions: 1. Download and modify PPI according to user-provided parameters (passed to load_ppi) Figure 2. Protein-protein interaction subnetwork for GAPDH, HSP90AA1, EEF1A1,HNRNPC, and TPT1. HSP90AA1 was identified as a critical hub in the computed subnetwork.
Not including the initial seeds, NUDCD2, TSSK6m RALYL, POTEF, and TOMM34 showed the highest affinity for the 148 provided seed proteins. Next, we will re-analyze these data using a different combination of analytic steps to illustrate more 149 features of crosstalkr.

5/11
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted March 10, 2023. ; https://doi.org/10.1101/2023.03.07.531526 doi: bioRxiv preprint Listing 4. gfilter can reduce the size of a graph using centrality measures like degree to rank nodes.

Node Scoring
Next, we will demonstrate a method for node scoring that uses in silico repression (Listing 5). For this example, we will use 161 network potential as the state function. The resulting data structure is sparse matrix that shows the effect of removing the node 162 in a given column on the state function value for the node represented in a given row. The global effect of removing a given 163 node is then the column sum. 4 names(dnp)[1:5] 168 5 #> [1] "HSP90AA1" "RPS27A" "UBA52" "CDK1" "CCNB1" matrix describing the graph. seeds is a vector describing the indexes of the provided seeds. gamma is a number describing the 207 discount rate. eps describes the stop condition. When the change in p in a single iteration is less than eps, the algorithm will 208 stop. tmax describes the largest number of iterations that the algorithm will attempt. The ouput is p, a vector containing the 209 affinity score of all nodes in w relative to the provided seeds. Listing 6. psuedocode for crosstalkr implementation of random walk with repeats

226
We computed bootstrapped null distributions using two related methods. In the algorithm implemented by compute_crosstalk,

227
we bootsrapped a null by re-calculating affinity scores (through RWR) associated with a series of randomly selected seeds that 228 shared a similar degree distribution compared to the user-provided seeds. For more details on this method, refer to 11 . In the 229 algorithm implemented by compute_null_dnp, we bootstrapped a null by generating n completely new graphs with preserved 230 degree distribution using the "keeping_degseq" method. We described this method in detail in our previous work 6 .

232
Network Potential (i.e. Gibbs Free Energy) is a node-specific metric that depends on the local context of each node in a graph. 233 We discuss network potential in detail in our previous paper 6 . For a given node V i and node weight (C i ) in a graph G(V, E), 234 network potential (G i ) can be computed as follows: where C j refers to the node weights for all neighbors of V i .

237
Betweenness centrality measures the number of shortest paths that traverse a given node v. Betweenness can be computed as 238 follows: where σ st is the number of shortest paths from node s to nodet and σ st (v) is the number of shortest paths from node s to node t 240 that traverse node v. In silico repression attempts to quantify the importance of a given node v in some graph G(V, E) by computing the a global 243 measure of graph state S before and after the removal of node v from the graph. In crosstalkr, in silico repression is a 6-step 244 process: 245 4. Re-calculation of node score s i v for all nodes v ∈ V . (In practice, we only re-calculate the node score of those nodes 249 affected by the removal of node i) 250 5. Re-calculation of total network state S i = ∑ i =v s iv 251 6. Calculation of ∆S i = S − S i where ∆S i is used to score and rank nodes.

8/11
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted March 10, 2023. ; https://doi.org/10.1101/2023.03.07.531526 doi: bioRxiv preprint