## Abstract

One of the most difficult and pressing problems in computational cell biology is the inference of gene regulatory network structure from transcriptomic data. Benchmarking network inference methods on model organism datasets has yielded mixed results, in which the methods sometimes perform reasonably well and other times fail to outperform random guessing. In this paper, we analyze the feasibility of network inference under different noise conditions using stochastic simulations. We show that gene regulatory interactions with extrinsic noise appear to be more amenable to inference than those with only intrinsic noise, especially when the extrinsic noise causes the system to switch between distinct expression states. Furthermore, we analyze the problem of false positives between genes that have no direct interaction but share a common upstream regulator, and explore a strategy for distinguishing between these false positives and true interactions based on noise profiles of mRNA expression levels. Lastly, we derive mathematical formulas for the mRNA noise levels and correlation using moment analysis techniques, and show how these levels change as the mean mRNA expression level changes.

## I. Introduction

In what is known as the “central dogma” of molecular biology, information encoded in DNA is transcribed into strands of messenger RNA (mRNA), which are then translated into proteins, which then carry out various functions within the cell. Sometimes the function of a protein from one gene involves regulating the expression of other genes. When a protein increases the expression of another gene, we refer to this as “activation.” When a protein decreases the expression of another gene, we refer to this as “repression.” Intricate networks of these positive and negative regulatory interactions between genes (called gene regulatory networks or GRNs) give rise to much of the complexity of life [17], [27]. Understanding the roles and functionality of GRNs is a pressing problem for cell biologists, especially since malfunctions of GRNs can have disastrous medical impacts on human health, leading to diseases like cancer, for example [19].

In this paper, we present models of different noise conditions related to regulatory interactions between genes. In this context, *intrinsic noise* refers to the inherent stochasticity in the processes of transcription, translation, and the degradation of mRNA and proteins, and is especially prevalent in cases of low copy number fluctuations of these molecules. By contrast, *extrinsic noise* refers to the impact of other factors, such as upstream regulators, external stimuli, or changes in cell state that affect the regulatory interaction [6], [13], [32], [36], [38].

A specific challenge for computational biologists studying GRNs is network inference – that is, the attempt to infer the structure of a GRN from gene expression data [29]. Although modern high-throughput next generation sequencing (NGS) experiments like RNA-seq have led to an abundance of gene expression data, the challenge of network inference is still quite difficult. Part of the reason for this difficulty is that NGS transcriptomic experiments like RNA-seq involve destroying each cell to sequence its RNA content. So, each cell provides only a single time point of data, rather than a timeseries dataset.

Most network inference methods attempt to infer regulatory interactions between genes based on statistical relationships between their expression levels (typically quantified by mRNA abundance). Some examples of these methods include correlation [45], linear or non-linear regression [11], [14], [37], information theory [4], [5], [7], [24], [25], Bayesian techniques [8], [44], and others [1]–[3], [15], [18], [43], [46].

An excellent introductory review of the topic of gene regulatory network inference can be found in Huynh-Thu and Sanguinetti 2018 [16]. Recent theoretical work on models of gene expression and regulation can be found in [22], [39]–[41].

## II. Efficacy of Network Inference Methods

Although the problem of gene regulatory network inference has been widely studied for more than a decade, there are still questions about the efficacy of these methods and whether network inference from transcriptomic data is a feasible goal. A key point of skepticism is that these methods typically assume that mRNA abundance measurements can be used as a reliable proxy for protein abundance. Typically, the protein (not the mRNA) produced by a gene is what regulates the expression of other genes, but it is often mRNA abundance data that we have access to, so most GRN inference methods take mRNA data as an input.

There are some reasons to question the assumption that mRNA abundance data can be used as a reliable proxy for protein abundance. For example, Mahajan et al. 2022 [21] shows though theoretical analysis and stochastic simulations that, under conditions of only intrinsic noise, the correlation between mRNA abundance and protein abundance *even for the same gene* becomes quite weak if there is a large difference between the mRNA stability and protein stability. Additionally, Liu et al. 2016 [20] reviews the literature and reports a similar finding, that the correlation between mRNA levels and protein levels can be weak in some scenarios, and knowledge of mRNA transcript abundance alone is not always sufficient to predict protein abundance levels.

So how well do these network inference methods actually work? There have been several attempts to test the efficacy of network inference methods by benchmarking them on data from model organisms, such as *E. coli*, *S. cerevisiae*, and mice [23], [26], [28]. In these benchmarking studies, the underlying structure of the gene regulatory network is already known from experimental investigation, so the predictions of network inference methods can be checked against the correct answers. The most famous of these benchmarking attempts is Marbach et al. 2012 [23]. The results of this benchmarking study were mixed. When tested on a *S. cerevisiae* dataset, network predictions failed to substantially outperform the accuracy that would be expected by random guessing. However, when tested on *E. coli* data, the network predictions performed substantially better than random guessing.

So, our current understanding of the efficacy of gene regulatory network inference is quite murky and uncertain. It seems that network inference from transcriptomic data cannot be considered entirely feasible or unfeasible. Rather, it seems to be feasible under some conditions and unfeasible under other conditions. In this paper, we attempt to shed light on this topic, investigating through stochastic simulations which noise conditions may be more or less amenable to network inference from transcriptomic data.

## III. Standard Activation Model, no Extrinsic Noise

In this section, we briefly review and replicate results from Mahajan et al. 2022 [21], which studied the feasibility of network inference from mRNA abundance data under conditions of only intrinsic noise. We consider a system with two genes: Gene 1 and Gene 2. Gene 1 is transcribed into mRNA1, which is then translated into the Protein. The Protein then activates the transcription of Gene 2 into mRNA2. We refer to this as the Activation scenario. A diagram of this scenario is shown in Figure 1.

We define the integer-valued random processes *M*_{1}(*t*), *P* (*t*), and *M*_{2}(*t*) to track the counts of the mRNA1, Protein, and mRNA2 respectively. For the sake of simplicity, we will refer to these processes as *M*_{1}, *P*, and *M*_{2} from now on.

The stochastic model is described Table I. The model consists of six events that occur probabilistically with rates given in the third column. When the event occurs, the counts for the variables are updated according to the reset map in the second column. Descriptions of the parameters, as well as the values we used in our simulations, are listed in Table II. With this setup, we can simulate the model using Gillespie’s stochastic simulation algorithm (SSA) [9].

We model the activation of *M*_{2} production by *P* as the Hill function , where *P* is the level of the Protein, *n* is the Hill coefficient (which determines how linear or nonlinear the activation is), *c* is a constant parameter that affects the saturation dynamics of the Hill function, and *k*_{2} is the maximum production rate. As *P* increases, the fraction saturates and approaches 1, so the entire term approaches *k*_{2}

Part of the analysis in [21] involved calculating the correlation between the mRNA levels under different assumptions about the relative stability of the mRNA and protein. In our case, we are intereGsetneed1in the correlaGtieonne 2between *M*_{1} and *M*_{2} under different assumptions about the relative stability of mRENxtArin1sicand the Protein. The stability of mRNA1 is the reciproFaccatlorof its degradation rate: The stability of the Protein, likewise, is: . So, the ratio of Protein stability to mRNA1 stability can be expressed as . We refer to this stability ratio as *τ*.

A key finding in [21] was that in this model with only intrinsic noise, correlation between the mRNA levels is quite weak, and gets weaker as the ratio of stability between the protein and mRNA increases. Figure 2 shows our replication of this result: the correlation between *M*_{1} and *M*_{2} is quite weak, and drops to nearly 0 as the ratio of stability between the Protein and mRNA1 increases.

This result seems to give a bleak outlook for the challenge of network inference. If there is weak or zero correlation between mRNA abundance for genes that regulate each other, how can we hope to infer gene regulatory network structure from transcriptomic data? However, as noted in the previous section, attempts to benchmark network inference methods on real data have yielded mixed results. In some cases, the network inference methods sometimes have performed reasonably well, and in other cases they have failed to outperform random guessing. In the next section, we will modify the model in a way that could explain this mixed-feasibility of network inference.

## IV. Cascade Model With Extrinsic Noise

In the previous section, we analyzed a model that included only intrinsic noise in the processes of transcription and translation. However, a more realistic model of the biological system might include extrinsic noise, which could come from environmental stimuli, changes to the internal cell state, or regulation from another upstream gene. In this section, we introduce a new component to the model, which we refer to as the Extrinsic Factor, to represent the extrinsic noise source. We track the level of the Extrinsic Factor with the integer-valued random process *Z*(*t*). From now on we will refer to *Z*(*t*) as simply *Z* for the sake of simplicity.

In this model, Gene 1 is transcribed mRNA1 (with counts tracked by *M*_{1}) which is then translated into the Protein (with counts tracked by *P*), which then activates the transcription of Gene 2 into mRNA2 (with counts tracked by *M*_{2}). However, unlike the model in the previous section, in this model the production mRNA1 is activated by the Extrinsic Factor. We purposely define the Extrinsic Factor in vague biological terms, so that it can be thought of as an upstream transcription factor, external stimulus, or any other source of extrinsic noise affecting the transcription of mRNA1. We refer to this as the Cascade scenario. A diagram of this scenario is shown in Figure 3.

This stochastic model is described in Table III. Unlike the other variables, which update with increases and decreases of 1, the production of the Extrinsic Factor occurs with a burst of size *β*, which is drawn from a random geometric distribution with mean . With this setup, increasing while holding the mean of *Z* constant increases the noise level of *Z*. In this section, we will report results for different mean burst sizes of , ranging 1 to 20. In all of these cases, the mean burst size is changed, and the parameter *k _{z}* is updated so that the mean of

*Z*over time is held constant. We model the degradation of the Extrinsic Factor with the parameter

*γ*. In this model, the production of mRNA1 is activated by the Extrinsic Factor via a Hill function, with a maximum production limit of

_{z}*κ*

_{1}. The Hill function parameters

*c*and

*n*are the same for both the activation mRNA1 transcription and the activation of mRNA2 transcription. Other than these changes, we model the degradation of mRNA1, the production and degradation of the Protein, and the production and degradation of mRNA2 the same as in the previous section.

Figures 4, 5, and 6 show three scenarios in which the mean of *Z* is held constant, but the burst size mean is varied. In Figure 4, the burst size is only 1, and in this simulation the correlation between *M*_{1} and *M*_{2} is 0.109. In Figure 5, the mean burst size is 10, and in this simulation the correlation between *M*_{1} and *M*_{2} is 0.334. In Figure 6, the mean burst size is 20, and in this simulation the correlation between *M*_{1} and *M*_{2} is 0.469.

Figure 7 shows the general relationship between the mean burst size and the correlation between *M*_{1} and *M*_{2}, and confirms what we could see visually in Figures 4, 5, and 6: higher mean burst size leads to higher levels of correlation between *M*_{1} and *M*_{2}. Figure 8 shows this correlation for both different mean burst sizes of and different Protein/mRNA1 stability ratios (*τ*). Here, correlation levels for the Cascade scenario are compared to the Activation scenario results from Figure 2. Under conditions of extrinsic noise with high bursts of Extrinsic Factor production, the level of correlation between *M*_{1} and *M*_{2} persists more than in the scenario with no extrinsic noise, although it becomes slightly weaker.

In this section, we purposely defined the Extrinsic Factor in abstract terms without a definite biological meaning. However, it is interesting to consider possible biological implications of these results. Note that as we increase the burst size of Extrinsic Factor production, the model begins to resemble a system with distinct transcriptional states, rather than stochastic fluctuations around a single steady state, as in the Activation scenario with only intrinsic noise. For example, in Figure 6, the system could be thought of as representing the switching between two expression states (an ON state and an OFF state in this case). This phenomenon of transient switching between distinct gene expression states is thought to play a role in many biological systems, including drug resistance in cancer [30], [31], so it is interesting to note that GRN inference from mRNA abundance data may be more feasible under these conditions than under a single steady state condition.

## V. Distinguishing Between Cascade and Coregulation

Another key finding from Mahajan et al. 2022 [21] related to the difficulty of distinguishing between scenarios in which one gene regulates another and scenarios in which both genes are regulated by a common upstream regulator. Both of these scenarios can yield a correlation between the mRNA levels, so there is a possibility of a false positive network inference error in the latter scenario. In this section, we analyze a situation in which rather than Gene 1 regulating Gene 2, instead Gene 1 and Gene 2 are both regulated by the Extrinsic Factor. We will attempt to distinguish between this scenario and the previous scenarios in which Gene 1 directly regulated Gene 2.

In this new model, we continue using the Extrinsic Factor (tracked by *Z*) to model extrinsic noise, which can be thought of as an upstream regulator in this case. However, instead of the Extrinsic Factor activating Gene 1, which then activates Gene 2, in this model the Extrinsic Factor activates both Gene 1 and Gene 2 directly, with no direct regulation between Gene 1 and Gene 2. We refer to this as the Coregulation scenario. A diagram of this scenario is shown in Figure 9.

The stochastic model is described in Table V. We model the production and degradation of the Extrinsic Factor the same as in the last section, with production occurring in bursts of size *β*, drawn from a geometric distribution with mean . As in the last section, the transcription of both mRNA1 and mRNA2 is modeled with Hill functions, and the Hill function parameters *c* and *n* are the same for both. However, unlike in the previous section, the transcription of mRNA2 is now activated by the Extrinsic Factor, not by the Protein. The Protein is left out of this model since we no longer need to track its abundance to model the transcription of mRNA2.

Figure 10 shows the correlation between *M*_{1} and *M*_{2} for different burst sizes in the Coregulation scenario, compared to the correlation in the Cascade scenario that we simulated in the previous section. As you can see, it is very difficult to distinguish between the Cascade scenario and the Coregulation scenario based on only the correlation between the two mRNA levels.

However, it may be possible to distinguish between these scenarios based on the noise profiles of *M*_{1} and *M*_{2}. Figure 11 shows the ratio in both scenarios, for burst sizes ranging from 2 to 20. *CV* here is the coefficient of variation, or the standard deviation of the sample divided by the mean of the sample. It appears that we can distinguish between the scenarios using this noise ratio, even though both scenarios have similar levels of correlation. In the Coregulation scenario, *M*_{1} and *M*_{2} have similar noise, so their CV ratio is close to 1. However, in the Cascade scenario, *M*_{1} has lower noise than *M*_{2}, leading to a lower CV ratio.

## VI. Further Analysis

In the previous sections, we used stochastic simulations of different scenarios to study the correlation between *M*_{1} and *M*_{2} under different noise conditions. While stochastic simulations are a valuable tool for analysis, it can also be helpful to have a mathematical framework for analysis that does not rely on simulations. In this section, we analyze simplified linear models of the three previously described scenarios, and derive formulas for the coefficients of variation of *M*_{1} and *M*_{2}, and the correlation between *M*_{1} and *M*_{2} for each scenario.

### Activation

We begin with the first scenario in which Gene 1 regulates Gene 2, with no extrinsic noise. In our previous model for this scenario, we used a nonlinear Hill function to describe the activation of Gene 2 by Gene 1. However, in this section, we will make the simplifying assumption of a linear regulatory relationship between Gene 1 and Gene 2, in order to make the model more amenable to mathematical analysis. We also make the simplifying assumption that mRNA1, Protein, and mRNA2 all have the same degradation rate, which we call *γ*. After making these assumptions, the stochastic model for this scenario is described in Table VI.

Our eventual goal is to derive a formula for the correlation between *M*_{1} and *M*_{2} in this model, as well as formulas for the coefficients of variation for both *M*_{1} and *M*_{2}. In order to do this, we start by deriving the first and second order steady state moments for all of the variables. For the rest of this section, we will use angle brackets to signify expected value. For example, *M*_{1} will denote the expected value of 〈*M*_{1}〉 (also called the first order moment of *M*_{1}), and will denote the expected value of (also called the second order moment of *M*_{1}).

With this simplified linear Activation model, we can use standard moment analysis techniques [10], [12], [33]–[35], [42] to solve for the first and second order steady state moments of *M*_{1} and *M*_{2}:

The coefficient of variation for a random variable can be written in terms of its first and second and order moments.

For example, the coefficient of variation for *M*_{1} can be written as:

Since we already have expressions for the first and second order moments of *M*_{1} and *M*_{2}, we can write the coefficients of variation for these variables in terms of the model parameters:

We note that, based on Equation 7, the variation in *M*_{1} is Poissonian, since We can also write the correlation between *M*_{1} *M*_{2} in terms of the first and second order moments:

We can write this in terms of the model parameters:

### Cascade

We use the same approach to derive formulas for these measures in the Cascade model, in which the Extrinsic Factor activates Gene 1, which then activates Gene 2. Again, we make the simplifying assumptions of linear activation rather than nonlinear activation via a Hill function, and that the mRNA1, Protein, and mRNA2 all have the same degradation rate, which we call *γ*. Also, we no longer model Extrinsic Factor production as a burst, so the *Z* production count update is now *Z → Z* + 1, not *Z → Z* + *β*. After making these assumptions, the stochastic model for the Cascade scenario is shown in Table VII.

We can use the same moment analysis techniques described in the previous section to write formulas for the coefficients of variation and correlation in the Cascade scenario. The coefficient of variation for *M*_{1} is:

The coefficient of variation for *M*_{2} and the correlation between *M*_{1} and *M*_{2} are shown at the top of the next page because of their large size.

### Coregulation

Once again, we use the same approach for the Coregulation scenario, in which the Extrinsic Factor activates both Gene 1 and Gene 2. We again make the simplifying assumptions that this is linear activation, rather than activation via Hill function, and that the mRNA1, Protein, and mRNA2 all have the same degradation rate, which we call *γ*. Again, we no longer model *Z* production as a burst, so the *Z* production count update is now *Z → Z* +1, not *Z → Z* + *β*. After making these assumptions, the stochastic model for the Coregulation scenario is shown in Table VIII.

We use the same moment analysis techniques as in the previous sections to write expressions for the coefficients of variation and correlation for *M*_{1} and *M*_{2}:

### Analytical Results

For this part of the analysis, we set the parameters so that 〈*M*_{1}〉 = 〈*M*_{2}〉, and so that this mean mRNA level is the same across all three scenarios. We then vary the mean mRNA level, and observe how the correlation and coefficients of variation change. In Figure 12, we show that as the mean mRNA level increases, the correlation between *M*_{1} and *M*_{2} increases in the Cascade and Coregulation scenarios, but decreases in the Activation scenario. In Figure 13, we show that as the mean mRNA expression level increases, the ratio of *M*_{1} noise to *M*_{2} noise holds steady at 1 in the Coregulation scenario, drops only very slightly before stabilizing in the Cascade scenario, and drops off quite steeply in the Activation scenario.

These results have some interesting biological implications. They seem to suggest that the difference in feasibility of network inference between the Activation and Cascade scenarios is more pronounced in high copy number situations compared to low copy number situations, and that there is also more of a false positive threat from Coregulation scenario in high copy number situations. They also suggest that the differences in noise profiles between the direct regulation scenarios (Activation and Cascade) and the Coregulation scenario persist in high copy number situations.

## VII. Discussion

In this paper, we have analyzed the feasibility of network inference under different noise conditions through stochastic simulations, considering both intrinsic and extrinsic noise. We began by replicating a key result from Mahajan et al. 2022 [21] which suggests that under conditions of only intrinsic noise, the correlation between mRNA abundance levels for two genes in an activation relationship is quite weak, and gets weaker as the ratio of protein stability to mRNA stability increases. Under these conditions of only intrinsic noise, network inference from transcriptomic data would be very difficult.

Next, we investigated a scenario in which an extrinsic noise source activates the expression of a gene, which then activates the expression of another gene. Under these conditions, we found that the correlation between mRNA abundance levels for the two genes gets stronger as the extrinsic noise begins to resemble a state variable. We also found that the correlation persists (although it becomes weaker) even as the ratio of protein stability to mRNA stability increases. A biological takeaway from this result is that if a cell has distinct transient expression states, resulting from external factors or internal regulatory network dynamics, then under those conditions the task of network inference from transcriptomic data seems more tractable than under conditions of intrinsic noise only. This result is notable because transient state-switching between different gene expression states is thought to play a role in various biological phenomena, including drug resistance in cancer [30], [31].

We then considered a scenario in which two genes are coregulated by a common upstream gene. Under these conditions, we still observe a correlation between the mRNA abundance levels of the two genes, potentially leading to a false positive error in the task of network inference. However, simulation results suggest that even though this scenario yields similar levels of correlation to the Cascade scenario, we may be able to distinguish between the two scenarios using the noise profiles of the mRNA levels.

Finally, we provided a mathematical framework for further analysis of simplified linear models of each of the noise scenarios. We used moment analysis techniques to derive expressions for the coefficients of variation of the mRNA levels, as well as the correlation between them, and explored how these measures change with changes in mean mRNA levels. This allows us to make predictions about the difference between the feasibility of network inference in low copy number and high copy number situations, for each of the three noise scenarios.

Future work will include further theoretical analysis of these models. Additionally, goals for future work include testing the predictions made in this paper on real biological datasets, using data from single cell RNA-seq experiments, as well as further benchmarking of network inference methods on data model organisms and synthetic gene regulatory circuits.

## VIII. Competing Interests

No competing interest is declared.

## IX. Author Contributions Statement

M.S. and A.S. contributed to the analysis, writing, and editing of this paper.

## X. Acknowledgments

This work is supported by grants from the Army Research Office (W911NF1910243) and the National Science Foundation (ECCS-1711548).