Abstract
Gene regulation is an important fundamental biological process. The regulation of gene expression is managed through a variety of methods including epigentic processes (e.g., DNA methylation). Understanding the role of epigenetic changes in gene expression is a fundamental question of molecular biology. Predictions of gene expression values from epigenetic data have tremendous research and clinical potential. Despite active research, studies to date have focused on using statistical models to predict gene expression from methylation data. In contrast, dynamical systems can be used to generate a model to predict gene expression using epigenetic data and a gene regulatory network (GRN) which can also serve as a mechanistic hypothesis. Here we present a novel stochastic dynamical systems model that predicts gene expression levels from methylation data of genes in a given GRN. Software for dataset preparation, model parameter fitting and prediction generation, and reporting are available at https://github.com/kordk/stoch_epi_lib.
Introduction
Gene regulation is an important fundamental biological process. [1] It involves a number of complex sub-processes that are essential for development and adaptation to the environment (e.g., cell differentiation [2] and response to trauma [3]). Understanding gene expression patterns has broad scientific [4] and clinical [5] potential, including providing insight into mechanisms of regulatory control [1] (e.g., gene regulatory networks) and a patient’s response to disease (e.g., HIV infection [6]) or treatment (e.g., chemotherapy-induced neuropathic pain [7]). The regulation of gene expression is managed through a variety of methods, including transcription, post-transcriptional modifications, and epigenetic processes [8]. One epigenetic process, DNA methylation, [9] occurs primarily at the cytosine base of the molecule that is adjacent to guanine (i.e., CpG site). While evidence exists to support a relationship between methylation and gene expression, the patterns of these associations can vary. [10] DNA methylation of promoter and gene body regions can act to regulate gene expression by repressing [11]) or activating [12] transcription. For example, higher gene expression can be associated with both decreased [13] and increased [14] methylation in regulatory regions, and with decreased methylation within the gene. [15] These associations vary with the distance from the promoter, [16] as well as between individuals and across tissues. [17]
Predicting gene expression levels from epigenetic data is an active area of research. Recent studies have developed models to predict gene expression levels with deep convolutional neural networks from genome sequence data [18] and with regression models from methylation data. [19, 20] Earlier studies developed models to predict expression status (e.g., on/off or high/low) with gradient boosting classifiers from histone modification data [21], with machine learning classification methods from methylation data [22], and from methylation and histone data combined. [23] However, these studies exclusively use a statistical approach to predicting gene expression. One limitation of the deep learning approaches is in the interpretation of the results. [24] A limitation of the linear model approaches are their inability to provide information regarding promoter binding events and the regulatory activities of the system. Neither of these approaches can provide a biological model to explain the expression estimates.
To address these limitations, we developed a dynamic interaction network model [25] that depends on epigenetic changes in a gene regulatory network (GRN). Dynamical systems integrate a set of simple interactions (i.e., transcription factor (TF) binding to a promoter region and subsequent gene expression) across time to produce a temporal simulation of a physical process (i.e., gene regulation in a given GRN). Therefore, the predictions of a dynamical systems model (e.g., TF binding and unbinding events, gene expression levels) emerge from a mechanistic understanding of a process rather than the associations between data (e.g., predicting an outcome from a set of predictor variables). A dynamical systems model can predict gene expression using epigenetic data and a GRN by simulating hypothesized mechanisms of transcriptional regulation. Such models provide predictions based directly on these biological hypotheses, and provide easy to interpret mechanistic explanations for their predictions. The dynamical systems approach offers a number of unique characteristics. First, a stochastic dynamical system provides us with a distribution of gene expression estimates, representing the possibilities that may occur within the cell. Next, the mechanistic nature of the approach means that the model can provide a biological explanation of its predictions in the form of a predicted activity level of various gene-gene regulatory interactions. Finally, a dynamical systems approach allows for the prediction of the effects of a change to the network. To our knowledge, there are no studies that have taken a dynamical systems approach to predicting gene expression from methylation data and a GRN.
Given the opportunity presented by dynamical systems approaches and the potential practical utility, we present a novel stochastic dynamical systems model for predicting gene expression levels from epigenetic data for a given GRN, along with a software package for model parameter fitting and prediction generation (available at https://github.com/kordk/stoch_epi_lib).
Methods
Here we use a dynamical systems approach to develop and fit a model to predict gene expression levels and transcription factor binding affinities from methylation data Fig. 1.
Model Equations
We model gene regulation using a piecewise-deterministic Markov process (PDMP) as introduced in Davis 1984 [26] (see also [27, 28]) given by the equations: and where Bi(t) ∈ {0, 1}, is a boolean random variable representing the bound/unbound state of a binding site region of DNA and gi is the transcript amount the genes modeled. Equation (1) is given as the sum of two Poisson jump processes and which take values in ℤ≥0, and are piecewise constant between randomly spaced discrete time points (the binding and unbinding events) [29].
The propensities and are taken to be linear functions of the available transcription factors, which is assumed to be the same as the transcript variables gj. We take the values κij ∈ {0, 1}; these parameters along with the set of ϕji represent the structure of the underlying gene regulatory network.
We include the term to represent the impact of epigenetic modification on the binding propensity of transcription factors. In this term, αi is the measured epigenetic modification to a transcription factor binding site (e.g. percentage of methylated bases). Equation (3) is a sigmoidal function which is either strictly increasing or strictly decreasing depending on the sign of νi.
Finally, we use a linear ODE for the value of the transcripts gj. We take ϕji ∈ {−1, 0, 1} based on the structure of the underlying gene regulatory network. We include baseline transcription γj and decay dj. Because we use a linear ODE in Eq. (2), we can solve exactly between jumps of B.
It is common practice in the study of reaction networks modeled as stochastic jump processes to represent the process using so called “chemical master equation” [29, 30] (i.e. the Kolmogorov forward equation for the jump process), which can be used within optimization methods to learn parameters for the system [31]. The generator for a PDMP can be defined (see Azaïs 2014 [32] for details). We define a density P (Bi, g(t), t) = Pi(g), i = 1, ..., |B| for each possible state Bi of B such that is the probability distribution for the vector g, and each Pi satisfies
Model Parameter Estimation
The parameters κij, ϕji and γj are determined by the structure of the underlying gene regulatory network and the epigenetic parameter αi is assumed measurable. We estimate the parameters λi, , μi, νi and dj. We estimate these parameters using a negative log-likelihood minimization procedure using stochastic gradient descent. Sample paths used to estimate the gradient of the likelihood are generated using one of two modified forms of Gillespies stochastic simulation algorithm (SSA) [33] which handle time-dependent jump propensities by adding an ODE to the system [27, 30] or by rejecting jumps chosen as in the standard SSA [34]. This procedure involves approximating the gradient of the map from parameter set to log-likelihood so that we may use a gradient descent method.
We can compute a log-likelihood for a set of paired epigenetic and transcription samples by time averaging a sample path against a Gaussian kernel. We estimate the likelihood of a sample of transcript data of d genes using a model realization B(t), gθ,α(t) as follows: where h is some smoothing parameter, θ is an n-dimensional vector of model parameters where n is the sum of the sizes of sets of λi, , μi, νi, and α represents the epigenetic parameters αi used by the model. For a dataset D consisting of n sets of matched pairs of transcription and epegenetic data , αl, we define the negative log-likelihood as:
We note that depends on computing gθ,α(t) from a realization of the stochastic process, and so depends on a set ω of random numbers. So that we are minimizing a deterministic map, we choose ω once and use this same random vector to generate every realization needed in computing the estimate . We make this explicit with the notation LD,ω(θ).
In Fig. 2, we give a schematic representation of how is estimated from a set of realizations of the model, each realization corresponding to a single data sample. Details of the gradient estimation are given in supplemental file S1.
Evaluation
Gene Regulatory Network
Gene to gene interactions were defined using the Discriminant Regulon Expression Analysis (DoRothEA) framework. [35] Transcription factor (TF) to target interactions were identified as those with the DoRothEA highest confidence interaction classification and scored as 1 or −1 for upregulating and downregulating, respectively. Binding site to target edges (ϕ) were defined by CpG methylation sites which were associated with changes in transcript expression (eCpG). [36]
Dataset
Matched epigenetic and gene expression data were obtained from whole blood from participants in the Grady Trauma Project (GTP) study (n=243 participants). Methylation data were obtained from the NCBI Gene Expression Omnibus (GEO) (GSE72680) and measured using the HumanMethylation450 BeadChip (Illumina, San Diego, CA). Methylation status was quantified as a beta score. A total of 19,258 eCpG probes were identified. Beta scores for CpG sites within the same region for a gene (i.e., classified as either ‘Promoter’ or ‘TSS’ [36]) were aggregated together as the mean. Gene regions where no DNA methylation data were collected were excluded. A total of 1,885 regions were identified.
Gene expression data were obtained from GEO (GSE58137) measured with the HumanHT-12 expression beadchip V3.0 (Illumina, San Diego, CA). Intensity scores were log2-transformed (mean expression intensity = 189.96, IQR = 49.88 to 106.60). Gene expression probes were first annotated to ENTREZ ID and then annotated to the symbol using the HUGO database. [37]
For evaluation, we identified a set of genes previously identified as deferentially expressed in individuals with PTSD as compared to controls (n=524). [38] Of these, we identified 278 TF to target mappings using the DoRothEA framework. We then used this list of genes to identify additional targets to include beyond initial list. The final set included 252 TF to target relationships comprised of 303 unique target genes. A GRN was built using these 303 genes as input producing a final network with 74 genes with 65 sites (Fig. 3). Of these 74 genes, 29 had sufficient data and regulatory information (i.e., methylation and gene expression data for all individuals, an eCpG binding site, and a TF to gene relationship) for which parameters could be estimated and expression distributions generated.
Cross Validation
Matched gene expression and methylation data from participants measured for expression (n=243) were used for evaluation. This primary dataset was split into training and testing datasets, containing 80%/20% (n=195, n=48 samples, respectively). To avoid the impact of a particular split, we repeated the shuffle process 100 times. [39] For each split of the data, parameter estimation was performed on the training set and equilibrium distributions of the predicted expression levels were generated using the testing set. For every round of cross-validation, the error in prediction was evaluated as the root mean square error (RMSE) [40] between the observed and the estimated expression from our model. To rank methods, the RMSEs (mRMSE) was averaged for each method across the 100 shuffles.
Model Comparison
To evaluate the performance of our gene expression predictions we generated linear regression models using the scikit-learn software package for python [41]. Based on previous studies that developed prediction models for gene expression using methylation data, [19, 20] we generated prediction models using LASSO, Multi-task LASSO, Elastic Net, and Multi-Task Elastic Net. The structural parameters for these models (i.e. penalty parameter and l1-ratio parameter) were determined using scikit-learn’s cross-validation methods with the entire data set. Finally, we fit a null model that is the average of the expression values from the training set. It is the prediction of expression values without any other variables in the model. Models were generated for each of the 100 data train/test shuffles used in our fitted model.
To evaluate the performance of our fitting procedure on gene expression predictions we generated predictions using a randomly generated parameter set a for each of the previously generated splits. Ten random estimates were generated for each shuffle giving 1000 predictions for each gene generated using random parameters. Parameters were estimated for all genes using the procedure detailed in Supplemental File 1.
Results
Across the final models, our fitted parameter model performed the best (Table 1, Fig. 4). Across all 29 genes, our model outperformed the null model as well as the four linear regression models Fig. 5. On average, our model outperformed the best performing linear regression model (i.e., ElasticNet) by a factor of 2. 16 after parameter fitting. The average root mean square errors for each gene across the 100 shuffles is reported in Table 2. We observed the highest performance for AK3 (average RSME = 1.101) and lowest for SCP2 (average RSME = 2.697). In this evaluation, our model biases towards underestimating the expression levels (see supplemental file).
Comparing the model with randomly generated parameters and fitted parameters reveals that our fitting procedure was effective. We see a 4.24-fold improvement in model performance on average after the fitting procedure. In fact, Fig. 4 demonstrates that, with random parameters, our model is unsurprisingly worse than a linear regression, but our fitting procedure returns a model that outperforms linear regression. Examples of the equilibrium distributions generated from the random parameter for the most accurate predicted gene (i.e., AK3) for two individual patients from different shuffles are shown in Fig. 6.
Discussion
In this study, we demonstrate that gene expression levels can be accurately predicted from methylation state of a promoter region and a GRN. Our model successfully uses quantitative data describing epigenetic modification of transcription factor binding sites to generate a probability distribution which describes the possible level of transcript. To our knowledge, this is the first study to develop and evaluate a stochastic dynamical systems model predicting gene expression levels from epigenetic data for a given GRN.
Overall our model outperforms linear regression approaches in the predictions of the model with fitted parameters (e.g., Fig. 7 a & b) and dramatic improvements to prediction relative to a randomly generated set of parameters (e.g., Fig. 7 c & d). We were able to accurately predict gene expression based on the structure of the GRN which allows for the identification of TF and binding sites that are associated with gene expression levels. For example, our model accurately predicted gene expression levels for both AK3 and IRF1, yet the GRN has different numbers of TF for each (i.e., a single TF for AK3 versus multiple TF for IRF1)(Fig. 3). From our initial list of 302 genes for inquiry, our TF to target and binding site reference data produced a gene regulatory network with 74 genes, of which 29 had sufficient regulatory information to be predicted. Although we were unable to evaluate a more complicated GRN from all reference regulatory data due to computational constraints, we expect that model predictions will improve with additional regulatory information. Future work is needed to improve the computational performance of the implementation to support larger and more complicated GRNs.
The estimated fit of the model to training data improved over iterations of the procedure. However, the means and standard deviations from the equilibrium distributions do not converge as quickly as we would like (data not shown). This slow convergence, and the necessity for repeated estimations, mean that computational time is a limiting factor. Future analyses should simulate longer to identify the appropriate cut offs given the data used, and thus improve the fit of the model parameters.
While the use of a stochastic dynamical system offers distinct advantages over more statistically-driven methods, a number of limitations of the our approach warrant discussion. First, our model is based on the assumption that epigenetic modification effects the propensity of the random process of transcription factor binding and unbinding. As seen in other studies, gene expression is a complex mechanisms that involves other epigenetic (e.g., histone modifications and non-coding RNAs) and genetic (e.g., DNA sequence variations) factors and varies across tissues and with age. Next, our model assumes that DNA transcription is a comparatively fast (and so approximated as deterministic) process that depends on transcription factor binding. Finally, our model implicitly assumes that processes of transcription of DNA to RNA and translation from RNA to the functional protein products are immediate. Finally, we limit the scope of our testing to linear production of DNA transcript, depending on transcription factor binding status. Future efforts will be focused on improving the prediction accuracy, improving prediction robustness across training sets, improving computational efficiency, and evaluating across other gene regulatory networks, gene sets, and datasets.
By using a dynamical systems approach, our model generates an estimation of gene expression given DNA methylation based on the mechanistic hypothesis of differential binding affinity of a transcription factor caused by epigenetic modification. Our model provides predictions based directly on the biological hypotheses presented by the GRN thereby providing an easy to identify potential mechanistic hypotheses for their predictions (i.e., the binding of TF to specific sites). In addition to gene expression predictions, the characteristics of the dynamical systems approach offers multiple additional opportunities for future evaluation. First, the dynamical systems approach allows study of complex regulatory networks, including those which contains cycles. The GRN used for evaluation was acyclic. Next, in predicting gene expression our model also predicts gene regulatory activity in the form of the boolean variables Bi(t), which may be interpreted as the bound/unbound state of a regulatory protein at some DNA binding site. Using this information, we expect that our model will provide insight beyond gene expression prediction by identifying specific differential regulatory activity (e.g., which regulatory sites are bound and to what extent). Finally, our model can also be used to predict the effects of changes in methylation states at particular sites on gene expression levels. By perturbing one area of the network (e.g., a binding site), the effects on the rest of the network can be predicted (e.g., differences in regulatory activity due to epigenetic characteristics of tumor versus normal tissues).
In conclusion, we developed a dynamical system model for predicting gene expression using a gene regulatory network and epigenome data. To our knowledge, this is the first study to develop and evaluate a stochastic dynamical systems model predicting gene expression levels from epigenetic data for a given GRN. Using our model, we were able to accurately predict gene expression levels from methylation data and outperformed linear regression models. Future applications of our method will include an evaluation of the additional opportunities offered by the characteristics of a dynamical systems approach including: (1) acyclic GRNs, (2) gene regulatory activity (i.e., binding), and (3) prediction of network perturbations.
Supporting information
S1 File. Further details of the method. Available on Zenodo https://doi.org/10.5281/zenodo.4441111.
S2 Code Repository. Method source code & sample data. Available at GitHub https://github.com/kordk/stoch_epi_lib with demonstration data available from Synapse https://www.synapse.org/#!Synapse:syn22255244/files.
Acknowledgments
This project was initially conceived as an interdisciplinary project as part of the “Short Course in Systems Biology - a foundation for interdisciplinary careers” at the Center for Complex Biological Systems at the University of California Irvine held Jan. 21 - Feb. 8, 2019 in Irvine, CA (NIH GM126365). This work was supported by the National Cancer Institute at the National Institute of Health under Grant CA134900.
Footnotes
↵* brunner.james{at}mayo.edu