## Abstract

Expression analysis and variant calling workflows are employed to identify genes that either exhibit a differential behaviour or have a significant functional impact of mutations. This is always followed by pathway analysis which provides greater insights and simplifies explanation of observed phenotype. The current techniques used towards this purpose have some serious limitations. In this paper, we propose a theoretical framework to overcome many limitations of current techniques. Our framework takes into account the networked nature of the data and provides facility to weigh each gene differently and in the process we do away with the need of arbitrary cut-offs. This framework is designed to be modular and provides the researchers with flexibility to plug analytical tools of their choice for every component. We also demonstrate effectiveness of our approach for personalized and cohort analysis of cancer gene expression samples with PageRank as one of the modules in the framework.

## 1 Introduction

Gene expression analysis and structural variants detection tools are used to identify genes that are significantly affected in a given disease condition. Tools ranging from earlier generation microarrays to Next Generation Sequencing like RNAseq, DNAseq and Exomeseq are often used to achieve this task. Microarray and RNAseq gene expression experiments are performed to measure changes in gene expression levels across conditions like normal vs tumor. Statistical analysis of this data usually results in p-value and fold change for each gene. Cutoffs are applied on both p-value (usually less than 0.05) and absolute fold change (usually greater than 2) to declare some of the genes as statistically significant. Structural Variant analysis workflows detect the variants in the given sample and various tools like SIFT [1], PolyPhen2 [2] and MutationAssessor [3] are employed to identify the functional impact of the mutations. Sometimes these are converted to gene level p-values and some genes are shortlisted based on certain thresholds. However, a list of significant genes alone does not provide necessary explanatory power owing to the complex interaction network of these genes. Therefore, these significant genes are further used to detect significantly affected pathways. Pathways are essentially groupings of genes based on their interactions and functions. Each pathway represents a subgraph of the gene interaction network and represents a specific cellular functionality. Techniques such as Over-Representation Analysis [4], Functional Class Scoring [5] and Pathway Topology based analysis are used to identify significantly affected pathways[6].

### 1.1 Need For a New Pathway Analysis Framework

Though a great variety of methodologies are available for pathway analysis, there are serious inconsistencies in these current approaches. Two recent papers by Khatri et al. [6] and by Mitrea et al. [7] provide a detailed review of various analytical approaches along with their shortcomings. The following are some recurring issues -

p-value depends on the nature of the test performed, type of multiple testing correction and more importantly degrees of freedom.

Reliable tests are not available for personalized analysis using a normal tumor sample pair.

The cutoffs are rigid and therefore a lot of information is lost. For example, a gene of p-value 0.05 and fold change 2 qualifies to be statistically significant whereas gene with p-value 0.051 and fold change 4 is not considered for further downstream analysis.

All the significantly expressed genes are treated equally for their pathways analysis irrespective of differences in their fold change values.

Pathways analysis treats and tests each pathway independently. However pathways interact with each other and many genes are part of more than one pathway.

Many of the current pathway databases do not concur on the graph structure of the pathways and the interactions graphs are far from complete (Soh, 2010). This uncertainty is not modeled by current algorithms

To summarize, the current methodology uses ad-hoc cutoffs, uses only part of the information and does not model the interactive nature of the data.

## 2 A Novel Approach

To address the problems mentioned in previous section, we propose a new approach which accommodates information of every single gene in the experiment and models the network of interactions of the genes. Our model is based on the PageRank algorithm [8]. PageRank has a proven mathematical foundation [9] and has been successfully applied to variety of network analysis problems across domains including biology [10].

The model is very general and each component can be modeled in a variety of different ways depending on nature of the task at hand. We illustrate with this with some examples for each component in section 4.

We first define the notation that we use in our model followed by the description of PageRank. We then discuss the model - the components and their interpretation.

### 2.1 Notation

Table 1 shows the notation that we follow in this paper.

Small letters are used to denote scalar quantities. Uppercase letters are used to denote sets. We use bold type faces to denote matrices and vectors. Bold uppercase letters are used to represent a matrix while bold lowercase letters are used to represent vectors. Subscripts are used to denote individual components.

### 2.2 PageRank

Given a directed, weighted graph *G*(*V*, *E*) consisting of the set of edges *E* which represent the interactions between vertices (genes) in the set *V*, the PageRank vector measures the importance of each vertex.

Each element of the adjacency matrix of the graph *G* is given by -

If the weights of the interactions are not available, the matrix *A _{ij}* reduces to a boolean matrix with

*w*= 1 for each interaction.

_{ij}The PageRank vector is computed on the normalized adjacency matrix *M* of the graph, where each entry is divided by the sum of its row (also called the outdegree of the vertex). The properties of *M* are -

0 ≤

*M*≤ 1_{ij}*M*= 0 if gene_{ij}*i*does not interact with gene*j*

The |*V*|-dimensional PageRank vector *r* is computed iteratively as the solution to following equation

PageRank and personalized PageRank are illustrated using a toy 11-vertex network in Figure 1. The vertices are sized according to their PageRank The PageRank vector can be interpreted as the probability of landing at a given node by a random walker who jumps from vertex to vertex at each iteration. The random walker can with a certain probability *α* choose to teleport at random to any vertex in the graph instead of following an edge. The unit vector *ŝ* of the |*V*|-dimensional personalization vector *s* gives the probability of landing at a vertex if the walker teleports. *α* is known as the damping factor, and is a tunable parameter between 0 and 1.

The algorithm to compute the PageRank vector is described in Algorithm 1.

### 2.3 Pathway Analysis Model

Our model uses an available gene-gene interaction network. In this new approach, we treat gene scores such as fold changes or functional impacts as the extent to which the gene is disturbed and determines the personalization vector. Hence, each iteration of the page rank algorithm represents the propagation of disturbance through the gene interaction network. These are the components of our model

*G*(*V*,*E*) = The gene-gene interaction network, consisting of the set of interactions*E*between the set of genes*V*.= |*p**V*|-dimensional vector of gene p-values.= |*f**V*|-dimensional vector of gene fold changes.*ϕ*() = a vector valued function that assigns weights to given vector of genes;*p*,*f**α*= damping factor; 0 ≤*α*≤< 1*r*= rank vector, computed using the PageRank algorithm using the quantities*G*(*V*,*E*), the personalization vector computed on the basis of*ϕ*() and*p*,*f**α*as input.

The interpretations of the components are as follows

*ϕ*determines the personalization vector*s*, which indicates the bias of interaction of network towards various genes. The bias is directly proportional to*s*.*α*= damping factor serves dual purpose (Boldi, 2005). It can be interpreted as our faith in current graph structure and it can also be used to adjust the bias induced by personalization vector. A very small value (close to zero) indicates that we don’t trust the current graph structure and any gene would randomly interact with any other gene in the network irrespective of the edges in the graph. The magnitude of alpha also affects the rate of convergence of the algorithm. Rate of convergence is inversely proportional to magnitude of*α*.*α*can also be used to model uncertainty and incompleteness of gene network. Most networks use value of around 0.85. We would recommend a values smaller than that (in range of 0.7 to 0.75) to compensate for the incompleteness of the interaction graph-structure.*r*= rank vector represents the final extent to which each gene is affected.

Thus, the page rank model takes into account the interconnected nature of the data and it can compensate for the incompleteness of the interaction graph structures.

## 3 An Algorithm for Pathway Analysis

Here, we present an approach for pathway analysis which differs from current enrichment based approaches. We would order the pathways based on how distribution ranks of the genes in a given pathway changes compared to itself and in overall context. For this purpose, we compute two different page ranks. These correspond to the following two scenarios

The term

**1**_{|V|}corresponds to a situation where the network has no bias.The term

*ϕ*() decides the extent to which each gene is disturbed and creates proportionate bias.*p*,*f*For both of the vectors, each entry is multiplied by the outdegree of the corresponding gene in the graph. The resulting vectors

*s*and^{N}*s*are used as the personalization vectors for the normal and diseased condition respectively.^{A}This makes sure that the affected genes with higher outdegrees disturb the graph more. Without this correction, the disturbance cause by a gene with higher degree gets diluted.

Here, we use same interaction matrix *M* and damping factor *α* in both these scenarios. Corresponding to personalization vectors *s ^{N}* and

*s*we get two rank vectors

^{A}*r*and

^{N}*r*respectively. Now we compute the distance of these two vectors for each pathway using distance metric of our choice. The pathways are sorted in descending order of absolute distance to quantify relative affectedness of the pathways. This provides the appropriate prioritization of the affected pathways. We prefer not have any cutoff to remove pathways. In order to compute relative affect-edness, we can compute divergence value of each pathway and sort pathways in descending order of pathways. A p-value can be computed from KL divergences if necessary. The entire procedure is summarised in Algorithm 2.

^{A}## 4 Illustrations for fine tuning of various components

Following are the components which one can fine tune

*ϕ*(): following are some possible definitions of f*p*,*f**ϕ*=**1**–where 1 denotes vector of all ones - Here we ignore fold change and use only p-values*p**ϕ*= –*log*() - This formulation will introduce non-linear exaggeration of p-values while deciding weights.*p**ϕ*= || - This uses absolute value of fold change*f**ϕ*= (1 –) * |*p*| - This uses both p-value and fold change.*f**ϕ*= –*log*() * |*p*- This use both p-value and fold change but non-linearly exaggerates impact of p-value.*f*

In short, one can combine multiple linear and nonlinear combinations of p-vlues and fold change to decide weights of genes in personalization vector.

Divergence - Consider a pathway

*P*with k genes*g*_{1},*g*_{2},*g*, and let be their respective values in rank vector_{k}*r*and be their respective values in rank vector^{N}*r*.^{A}KL divergence - We can compute impact of disease on pathway p as Kullback-Leibler (KL) divergence of disease pathway vector

*r*from normal pathway vector^{A}*r*. KL divergence is used to calculate distance between two probability distributions. In the given context, impact of disease on a pathway can be measured as^{N}Where

*D*(^{KL}*r*||^{N}*r*) is divergence of^{A}*r*from^{A}*r*. KL divergence is a asymmetric measure, in certain scenarios a symmetric measure might be desirable. Jensen-Shannon divergence can be used in these scenarios.^{N}Mean Absolute Deviation (MAD): average of absolute fold changes of rank values of disease to normal.

Interaction adjacency matrix M:

We have chosen to weigh all types of interaction equally. Different interactions (that is graph edges) can be assigned different weights depending on rate of interaction.

We have used unsigned matrix. However, one can model interactions as signed edges. In such scenarios, variations of page rank from social network analysis based on trust and distrust propagation (Guha, 2004) can be used to compute rank.

M can represent heterogeneous graph that includes more cellular entities which can provide much more comprehensive model. In addition to genes, following are the some of the entities one can add

miRNAs here miRNAs can be connected to their targets. Edges could be signed.

promoters connected with corresponding gene. This can be useful to model impact of methylation

M can even represent interactions among transcripts instead of genes. A gene can give rise to multiple transcripts and each transcript can code for different protein. With RNAseq, it is possible to estimate transcript description accurately. A transcript graph would provide finer level of information.

## 5 Materials and Methods

We use Reactome pathway database (Version 2014) to generate gene-gene interaction network. We use publicly available data from TCGA and GEO for our analysis. We demonstrate results of two analysis scenarios.

Personalized Analysis

Cohort Analysis

Damping factor is set at *α* = 0.7. For nodes that do not exist in the expression analysis, a default value of 1 was assumed. We use mean absolute deviation (Equation (3)) of the rank values to calculate pathway level disturbances.

### 5.1 Personalized Analysis

This analysis considers only a single patient’s data. Hence, a normal-tumor breast cancer sample pair (TCGA-BH-A0H7-11A-13R-A089-07, TCGA-BH-A0H7-01A-13R-A056-07) gene expression data is analyzed.

Here, *ϕ*(* p, f*) = |

*f*|. Only the absolute fold change values are used for personalization. No p-value is calculated or used in this analysis. Top twenty pathways from this analysis are presented in Table 2.

### 5.2 Cohort Analysis

This analysis is performed on small cell lung carcinoma samples. Microarray gene expression data from 60 pairs of normal and tumor is analyzed. This data is obtained from GEO( Dataset Record: GDS3837, Data Accession Series: GSE19804). p-value is obtained using a paired t-test.

Here, a personalization function based on both the computed p-values and the fold changes is used - *ϕ*(* p, f*) =

*f** (1 –

*p*). Top twenty pathways from this analysis are presented in table 3

## 6 Related Work

As explained in earlier sections, PageRank attempts to propagate an entity (belief, disturbance, perturbation- depending on the application) across the edges of the graph until steady state is achieved. The convergence is guaranteed if adjacency matrix is non negative. Certain variants model it for signed graphs, however due to lack of convergence they perform some maximum number of iterations and accept the final vector as rank vector. There are some examples of PageRank and its variants being used in NGS and pathway analysis in general. We briefly explain two such examples this section.

SPIA [11] combines both over representation analysis and perturbation analysis to come up with a p-value for a pathway. The perturbation factor for each gene is calculated as sum of signed log-fold change and the sum of perturbation factors of the genes directly upstream of the target gene, normalized by the number of downstream genes. So the perturbation propagation is similar to belief propagation. However this approach has two shortcomings. The perturbation propagation involves negative entries, so algorithm is not guaranteed to converge. Once the fold changes are captured in personalization vector, overrepresentations analysis becomes redundant.

DawnRank [12] attempts to rank mutated genes in a single cancer patient based on its potential to be a driver gene. Here fold change serves as personalization component of PageRank. However, it uses *M* instead of *M ^{T}* in PageRank formulations. That means nodes with higher centrality values get ranked higher. Here node centrality is interpreted as ability to affect other nodes in the graph.

## 7 Future Scope

In this paper, we proposed a new framework for pathway analysis which addresses some of the imminent issues of the state of the art. However, there is huge scope for improvement as follows.

The current pathway databases are far from complete and have high degree of discordance. Though damping factor mitigates some of the uncertainty, a comprehensive pathway database will lead to more accurate analysis

Rate of interaction is not currently available for all the interactions in the network. When rate of interactions will be available for all the interactions, Flux Balance Analysis (FBA) could be employed or page rank algorithm needs to be modified to accommodate rate of interaction.

The interaction network can be enhanced by accommodating miRNA and transcripts