Summary
Transcriptomic analysis is used to capture the molecular state of a cell or sample in many biological and medical applications. In addition to identifying alterations in activity at the level of individual genes, understanding changes in the gene networks that regulate fundamental biological mechanisms is also an important objective of molecular analysis. As a result, databases that describe biological pathways are increasingly relied on to assist with the interpretation of results from large-scale genomics studies. Incorporating information from biological pathways and gene regulatory networks into a genomic data analysis is a popular strategy, and there are many methods that provide this functionality for gene expression data. When developing or comparing such methods, it is important to gain an accurate assessment of their performance, with simulation-based validation studies a popular choice. This necessitates the use of simulated data that correctly accounts for pathway relationships and correlations. Here we present a versatile statistical framework to simulate correlated gene expression data from biological pathways, by sampling from a multivariate normal distribution derived from a graph structure. This procedure has been released as the graphsim R package (https://github.com/TomKellyGenetics/graphsim) and is compatible with any graph structure that can be described using the igraph package.
Statement of Need Provides a flexible framework to simulate biological pathways from a graph structure based on a statistical model of gene expression.
Introduction: inference and modelling of biological networks
Network analysis of molecular biological pathways has the potential to lead to new insights into biology and medical genetics [1, 2]. Since gene expression profiles capture a consistent signature of the regulatory state of a cell [3–5], they can be used to analyse complex molecular states with genome-scale data. However, biological pathways are often analysed in a reductionist paradigm as amorphous sets of genes involved in particular functions, despite the fact that the relationships defined by pathway structure could further inform gene expression analyses. In many cases, the pathway relationships are well-defined, experimentally-validated, and are available in public databases [6]. As a result, network analysis techniques could play an important role in furthering our understanding of biological pathways and aiding in the interpretation of genomics studies.
Gene networks provide insights into how cells are regulated, by mapping regulatory interactions between target genes and transcription factors, enhancers, and sites of epigenetic marks or chromatin structures [1, 7]. Inference of these regulatory interactions for genomics investigations has the potential to radically expand the range of candidate biological pathways to be further explored, or to improve the accuracy of bioinformatics and functional genomic analysis. A number of methods have already been developed to utilise timecourse gene expression data [8, 7] using gene regulatory modules in state-space models and recursive vector autoregressive models [9, 10]. Various approaches to gene regulation and networks at the genome-wide scale have lead to novel biological insights [8, 11]. However, inference of regulatory networks has thus far relied on experimental validation or resampling-based approaches to estimate the likelihood of specific network modules being predicted [12, 13].
There is a need, therefore, for a systematic framework for statistical modelling and simulation of gene expression data derived from hypothetical, inferred or known gene networks. Here we present an R package to achieve this, where samples from a multivariate normal distribution are used to generate normally-distributed log-expression data, with correlations between genes derived from the structure of the underlying pathway or gene regulatory network. This methodology enables simulation of expression profiles that approximate the log-transformed and normalised data from microarray and bulk or single-cell RNA-Seq experiments. This procedure has been released as the graphsim R package to enable the generation of simulated gene expression datasets containing pathway relationships from a known underlying network. These simulated datasets can be used to evaluate various bioinformatics methodologies, including statistical and network inference procedures.
Methodology and software
Here we present a procedure to simulate gene expression data with correlation structure derived from a known graph structure. This procedure assumes that transcriptomic data have been generated and follow a log-normal distribution (i.e., log(Xij) ~ MVN(μ, ∑), where μ and ∑ are the mean vector and variance-covariance matrix respectively, for gene expression data derived from a biological pathway) after appropriate normalisation [14, 15]. Log-normality of gene expression matches the assumptions of the popular limma package, which is often used for the analysis of intensity-based data from gene expression microarray studies and count-based data from RNA-Seq experiments. This approach has also been applied for modelling UMI-based count data from single-cell RNA-Seq experiments in the DESCEND package [16].
In order to simulate transcriptomic data, a pathway is first constructed as a graph structure, using the igraph R package [17],, with the status of the edge relationships defined (i.e, whether they activate or inhibit downstream pathway members). This procedure uses a graph structure such as that presented in Figure 1a. The graph can be defined by an adjacency matrix, A (with elements Aij), where
A matrix, R, with elements Rij, is calculated based on distance (i.e., number of edges contained in the shortest path) between nodes, such that closer nodes are given more weight than more distant nodes, to define inter-node relationships. A geometrically-decreasing (relative) distance weighting is used to achieve this: where dij is the length of the shortest path (i.e., minimum number of edges traversed) between genes (nodes) i and j in graph G. Each more distant node is thus related by compared to the next nearest, as shown in Figure 2b. An arithmetically-decreasing (absolute) distance weighting is also supported in the package which implements this procedure:
Assuming a unit variance for each gene, these values can be used to derive a ∑ matrix: where ρ is the correlation between adjacent nodes. Thus covariances between adjacent nodes are assigned by a correlation parameter (ρ) and the remaining off-diagonal values in the matrix are based on scaling these correlations by the geometrically weighted relationship matrix (or the nearest positive definite matrix for ∑ with negative correlations).
Computing the nearest positive definite matrix is necessary to ensure that the variance-covariance matrix could be inverted when used as a parameter in multivariate normal simulations, particularly when negative correlations are included for inhibitions (as shown below). Matrices that could not be inverted occurred rarely with biologically plausible graph structures but this approach allows for the computation of a plausible correlation matrix when the graph structure given is incomplete or contains loops. When required, the nearest positive definite matrix is computed using the nearPD function of the Matrix R package [18] to perform Higham’s algorithm [19] on variance-covariance matrices. The graphsim package gives a warning when this occurs.
Illustrations
Generating a Graph Structure
The graph structure in Figure 1a was used to simulate correlated gene expression data by sampling from a multivariate normal distribution using the R package [20, 21]. The graph structure visualisation in Figure 1 was specifically developed for (directed) iGraph objects in and is available in the and packages. The plot_directed function enables customisation of plot parameters for each node or edge, and mixed (directed) edge types for indicating activation or inhibition. These inhibition links (which occur frequently in biological pathways) are demonstrated in Figure 1b.
A graph structure can be generated and plotted using the following commands in R: #install packages required (once per machine) install.packages(“igraph”) if(! require(“devtools”)){ install.packages(“devtools”) library(“devtools”) } devtools::install_github(“TomKellyGenetics/graphsim”) #load required packages (once per R instance) library(“igraph”) library(“graphsim”) #generate graph structure graph_edges <-rbind(c(“A”, “C”), c(“B”, “C”), c(“C”, “D”), c(“D”, “E”), c(“D”, “F”), c(“F”, “G”), c(“F”, “I”), c(“H”, “I”)) graph <-graph.edgelist(graph_edges, directed = TRUE) #plot graph structure (Figure 1) plot_directed(graph, state =“activating”, layout = layout.kamada.kawai, cex.node=3, cex.arrow=5, arrow_clip = 0.2) #generate parameters for inhibitions state <-c(1, 1, -1, 1, 1, 1, 1, -1, 1) #plot graph structure with inhibitions (Figure 2) plot_directed(graph, state=state, layout = layout.kamada. kawai, cex.node=3, cex.arrow=5, arrow_clip = 0.2)
Generating a Simulated Expression Dataset
The correlation parameter of ρ = 0.8 is used to demonstrate the inter-correlated datasets using a geometrically-generated relationship matrix (as used for the example in Figure 2c). This ∑ matrix was then used to sample from a multivariate normal distribution such that each gene had a mean of 0, standard deviation 1, and covariance within the range [0, 1] so that the off-diagonal elements of ∑ represent correlations. This procedure generated a simulated (continuous normally-distributed) log-expression profile for each node (Figure 2e) with a corresponding correlation structure (Figure 2d). The simulated correlation structure closely resembled the expected correlation structure (∑ in Figure 2c) even for the relatively modest sample size (N = 100) illustrated in Figure 2. Once a gene expression dataset comprising multiple pathways has been generated (as in Figure 2e), it can then be used to test procedures designed for analysis of empirical gene expression data (such as those generated by microarrays or RNA-Seq) that have been normalised on a log-scale.
The simulated dataset can be generated using the following code: #adjacency matrix adj_mat <-make_adjmatrix_graph(graph) #relationship matrix dist_mat <-make_distance_graph(graph_test4, absolute = FALSE) #sigma matrix directly from graph sigma_mat <-make_sigma_mat_dist_graph(graph, 0.8, absolute = FALSE) #show shortest paths of graph shortest_paths <-shortest.paths(graph) #generate expression data directly from graph expr <-generate_expression(100, graph, cor = 0.8, mean = 0, comm = F, dist = TRUE, absolute = FALSE, state = state) #plot adjacency matrix heatmap.2(make_adjmatrix_graph(graph), scale = “none”, trace = “none”, col = colorpanel(3, “grey75”, “white”, “blue”), colsep = 1:length(V(graph)), rowsep = 1:length(V(graph))) #plot relationship matrix heatmap.2(make_distance_graph(graph_test4, absolute = FALSE), scale = “none”, trace = “none\”, col = bluered(50), colsep = 1:length(V(graph)), rowsep = 1:length(V(graph))) #plot sigma matrix heatmap.2(make_sigma_mat_dist_graph(graph, 0.8, absolute = FALSE), scale = “none”, trace = “none”, col = bluered(50), colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)))
expr <-generate_expression(100, graph, cor = 0.8, mean = 0, comm = FALSE, dist =TRUE, absolute = FALSE, state = state) #plot simulated expression data heatmap.2(expr, scale = “none”, trace = “none”, col = bluered(50), colsep = 1:length(V(graph)), rowsep = 1:length(V(graph))) #plot simulated correlations heatmap.2(cor(t(expr)), scale = “none”, trace = “none”, col = bluered(50), colsep = 1:length(V(graph)), rowsep = 1:length(V(graph)))The simulation procedure (Figure 2) can similarly be used for pathways containing inhibitory links (Figure 3) with several refinements. With the inhibitory links (Figure 3a), distances are calculated in the same manner as before (Figure 3b) with inhibitions accounted for by iteratively multiplying downstream nodes by –1 to form modules with negative correlations between them (Figures 3c and 3d). A multivariate normal distribution with these negative correlations can be sampled to generate simulated data (Figure 3e).
The simulation procedure is also demonstrated here (Figure 4) on a pathway structure for a known biological pathway (from reactome R-HSA-2173789) of TGF-β receptor signaling activates SMADs (Figure 4a) derived from the Reactome database version 52 [6]. Distances are calculated in the same manner as before (Figure 4b) producing blocks of correlated genes (Figures 4c and 4d).
This shows that multivariate normal distribution can be sampled to generate simulated data to represent expression with the complexity of a biological pathway (Figure 4e). Here SMAD7 exhibits negative correlations with the other SMADs consistent with it’s functions as as an “inhibitor SMAD” with competitively inhibits SMAD4.
These simulated datasets could then be used for simulating synthetic lethal partners of a query gene within a graph network. The query gene was assumed to be separate from the graph network pathway and was added to the dataset using the procedure in Section [methods:simulating_SL]. Thus I can simulate known synthetic lethal partner genes within a synthetic lethal partner pathway structure.
Summary and discussion
Biological pathways are of fundamental importance to understanding molecular biology. In order to translate findings from genomics studies into real-world applications such as improved healthcare, the roles of genes must be studied in the context of molecular pathways. Here we present a statistical framework to simulate gene expression from biological pathways, and provide the package in to generate these simulated datasets. This approach is versatile and can be fine-tuned for modelling existing biological pathways or for testing whether constructed pathways can be detected by other means. In particular, methods to infer biological pathways and gene regulatory networks from gene expression data can be tested on simulated datasets using this framework. The package also enables simulation of complex gene expression datasets to test how these pathways impact on statistical analysis of gene expression data using existing methods or novel statistical methods being developed for gene expression data analysis.
Computational details
The results in this paper were obtained using R 3.6.1 with the igraph 1.2.4.1 Matrix 1.2-17, matrixcalc 1.0-3, and mvtnorm 1.0-11 packages. R itself and all dependent packages used are available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/. The graphsim and plot.igraph packages presented can be installed from https://github.com/TomKellyGenetics/graphsim and https://github.com/TomKellyGenetics/plot.igraphrespectively. These functions can also be installed using the igraph.extensions library at https://github.com/TomKellyGenetics/igraph.extensions which includes other plotting functions used. This software is cross-platform and compatible with R installations on Windows, Mac, and Linux operating systems. The package GitHub repository also contains Vignettes with more information and examples on running functions released in the R package. The package (graphsim 0.1.2) meets CRAN submission criteria and will be released.
Acknowledgements
This package was developed as part of a PhD research project funded by the Postgraduate Tassell Scholarship in Cancer Research Scholarship awarded to STK. We thank members of the Laboratory of Professor Satoru Miyano at the University of Tokyo, Institute for Medical Science, Professor Seiya Imoto, Associate Professor Rui Yamaguchi, and Dr Paul Sheridan (Assistant Professor at Hirosaki University,CSO at Tupac Bio) for helpful discussions in this field. We also thank Professor Parry Guilford at the University of Otago, Professor Cristin Print at the University of Auckland, and Dr Erik Arner at the RIKEN Center for Integrative Medical Sciences for their excellent advice during this project.
Footnotes
↵† mik.black{at}otago.ac.nz
https://joss.theoj.org/papers/96016c6a55d7f74bacebd187c6ededd6