Summary
We propose a hierarchical Bayesian approach to infer the RNA synthesis, processing, and degradation rates from sequencing data. We parametrise kinetic rates with novel functional forms and estimate the parameters through a Dirichlet process defined at a low level of hierarchy. Despite the complexity of this approach, we manage to perform inference, clusterisation and model selection simultaneously. We apply our method to investigate transcriptional and post-transcriptional responses of murine fibroblasts to the activation of proto-oncogene MYC. We uncover a widespread choral regulation of the three rates, which was not previously observed in this biological system.
1. Introduction
RNA is one of the most important actors in the context of cellular biology and it is involved, directly or indirectly, in any process that occurs inside a cell. This molecule is a cornerstone of the information-flow which subsists from DNA to proteins, due to both its role as a template for protein assembly and because of the involvement of non-coding RNAs in the regulation of gene expression (e.g. modulation of transcript stability, protein synthesis and protein localisation) (Marchese et al., 2017; Slack and Chinnaiyan, 2019; Vandevenne et al., 2019). A cell constantly regulates the expression levels of thousands of genes, i.e. the number of associated transcripts, in order to preserve its homoeostasis and adapt to the environment. This regulation is mediated by the choral action of several biological processes that affect the life cycle of the RNA molecules produced by these genes.
The RNA life-cycle in eukaryotic cells can be simplified the following three steps: the synthesis of premature RNA molecules in the nucleus, their processing into mature transcripts (which includes exporting the cytosol), and mature RNA cytoplasmic degradation. The characterisation of these mechanisms, which the cell exploits to modulate the amount of specific transcripts, according to internal and external stimuli, can provide exceptional insights into the biology of these responses. A RNA life-cycle investigation requires the experimental quantification of the gene expression levels. The state-of-the-art approach to perform this task is Next Generation RNA sequencing (RNA-Seq). In a typical experiment, this technique simultaneously provides the average expression level per cell for thousands of genes, at a low cost and with a limited experimental effort (Goodwin et al., 2016). For these reasons, a remarkable number of public RNA-Seq datasets are currently available, and easily accessible, through such open repositories as the Gene Expression Omnibus project (Edgar et al., 2002).
A large amount of literature is available on the analysis of RNA-Seq data. In some papers, mixture models are used on the observed data to identify differences in gene expression levels, see, for example, “RNA-Seq by expectation-maximization” (Li and Dewey, 2011), “Cufflinks” (Trapnell et al., 2013), “Casper” (Rossell et al., 2014), the new approach developed by Papastamoulis and Rattray (2018), and the works of de Souto et al. (2008), Oyelade et al. (2016), or Saelens et al. (2018). Some of these tools have also been used for the identification of genes which are differentially expressed under multiple experimental conditions, this being one of the most common practices in the field (Trapnell et al., 2013; Papastamoulis and Rattray, 2018).
However, the mere quantification of expression levels is not enough to acquire a full picture of the RNA life cycle, and the study of this datum alone could lead to misleading conclusions. Indeed, a cell can regulate gene expression through different fundamental processes. For instance, an expressed gene is usually assumed to be actively transcribed in the biological condition under analysis, but this is not alway the case for very stable transcripts that remain in the cell long after their synthesis. Moreover, an increase in the expression level of a gene between two conditions is usually interpreted as an intensification of its transcription, although the same observation could be due to modulation of the transcripts stability.
The mathematical modelling of the RNA life cycle can help to deconvolve the experimental data and characterise the different stages of theRNA metabolism. This can be formalised in terms of a network of chemical reactions (see for example de Pretis et al., 2015; Feinberg, 2019; Anderson and Kurtz, 2015), and can be modelled either deterministically or stochastically. The former approach is usually preferred, since it is compatible with standard RNA-Seq datasets originating from cell populations and because it leads to a system of linear ordinary differential equations (ODEs) whose time-dependent coefficients, the so-called kinetic rates (KRs), can be interpreted as the instantaneous rates at which the fundamental synthesis, processing and degradation mechanisms occur.
In the last few years, several tools have been proposed to infer KRs from experimental data, and of these, DRiLL (Rabani et al., 2014) and INSPEcT (de Pretis et al., 2015; Furlan et al., 2020) provide a characterisation of all the crucial steps of the RNA life cycle from sequencing data. These tools are based on a least-squared estimation, and each gene is assumed to be independent of the others.
Motivated by a real data application, we here propose a novel approach to the inference of RNA life-cycle KRs from gene expression levels, cast in a Bayesian framework, and characterised by the application of mixture models defined using the Dirichlet process (DP) (Ferguson, 1973) on rate parameters. KRs are not directly observable, and the data-level of the mixture models are therefor latent quantities. This introduces difficulties in the model estimation, and we need to introduce a new parametrisation of the KR functions and to impose suitable identifiability constraints to overcome such difficulties. At the same time we tailor an MCMC algorithm for the mixture models, based only on Gibbs steps (Casella and George, 1992), to avoid an increase in computational burden. This is made possible thanks to the adoption of a suitably modified likelihood function. Unlike other proposals, we estimate likelihood parameters, data standard deviations, KR functions, and latent clusterisations in a single Bayesian model, thereby allowing a coherent evaluation of the uncertainty. Moreover, by providing a clusterisation without the need of any post-processing, our proposal is able to gather genes in homogeneous groups and to extract and exploit the shared information. The inclusion of this clustering step in the inference procedure results in the estimation of parameters, at the gene level, even though if the number of experimental observations is limited; a standard situation in biology. Our new method is particularly suitable for investigating the common biological scenario, in which the cell synergically regulates the expression of groups of genes to respond to a modification of its environmental conditions (Allocco et al., 2004).
We apply the model to our motivating example, where data are collected to study the activation of the proto-oncogene MYC in murine fibroblasts (de Pretis et al., 2017). Moreover, we demonstrate the inferential gain provided by our approach, which results in the detection of smaller, but significant, modulations of post-transcriptional rates. From an interpretative point of view, the classification identifies groups of genes modulated by MYC activation, both directly and indirectly, and characterised by specific features, such as the steady-state value of the rate or the timing of its modulation in response to the stimulus. Since MYC is a transcription factor, the synthesis rate is the most informative layer of regulation and provides clusters of genes involved in basic cellular processes, cancer-related processes, and in the RNA metabolism. Nevertheless, we manage also to identify pervasive modulations of post-transcriptional rates, most likely due to either secondary regulations or the adaptation of the entire RNA life-cycle kinetics in response to MYC induced transcriptional stress.
The paper is organised as follows. We start by describing the experiment performed by de Pretis and colleagues to study MYC activation, and the resulting dataset (Section 2). We then present the mathematical model we use to describe the RNA life cycle (Section 3) and the function we developed to parametrise the KRs (Section 4); we also discuss the solutions to some identifiability issues. We proceed by formalising the latent clustering models and their practical application to study MYC activation (Section 5). The final section of the paper (Section 6) regards a comparison of our novel Bayesian approach with other methods, and a discussion of our results. We conclude with a critical summary of our work and some perspectives (Section 7). The online supplemental material (SM), available on the web page of the journal, contains additional figures that may be useful to discuss the results but which are not essential for the comprehension of the paper.
2. Data description
Our dataset, taken from de Pretis et al. (2015), is organised as illustrated in Figure 1 A. It provides expression levels (in Reads Per Kilobase Million, RPKM) of premature, mature, and nascent RNA for more than 10.000 genes, at 11 time-points, and for three replicas of the experiment. The experiment is designed to follow the activation of the transcription factor MYC in a murine fibroblast cell-line (3T9) over time. This transcription factor plays a crucial role in the genesis and progression of tumours, and it is involved to a great extent in the regulation of such basal cellular processes as differentiation, growth and proliferation (Dang, 2012; Chen et al., 2018).
The experiment starts with a population of cells, which is divided into multiple samples, in a stationary biological environment. Each sample is treated to induce MYC activation and, after a different span of time, it is sequenced to quantify RNA expression levels. MYC activation is achieved through the expression of an artificial chimaera (Littlewood et al., 1995). This protein is natively inactive, i.e. it is unable to perform any function, but it can be rapidly activated by adding the 4-hydroxytamoxifen (OHT) hormone to the cell culture medium. The authors performed standard (ribo-depleted) RNA-Seq, following MYC activation, through 11 time-points from an OHT treatment: 0h, 1/6h, 2/6h, 1/2h, 1h, 3/2h, 2h, 4h, 8h, 12h, 16h (Figure 1 A). Each experiment, which was performed on independent samples, was replicated three times, and gave expression levels of premature and mature RNA (Figure 1 B).
The same experimental design was used to quantify nascent RNA through 4sU-Seq (Figure 1 C). In this case, an exogenous nucleotide (4-thiouridine or 4sU) is provided to the cells before sequencing for a fixed span of time (labelling time). 4sU is incorporated in the transcripts produced from that moment for the entire labelling time (nascent RNA) and is later exploited to physically separate them from the other RNA molecules (pre-existing RNA). This portion of the transcriptome can be sequenced through standard RNA-Seq (Dölken et al., 2008).
de Pretis et al. (2015) focused on a set of 4909 transcriptional units, classified as MYC targets through a chromatin immunoprecipitation sequencing experiment, and altered in their kinetics. We decided to restrict our study to the same group of genes so that our results could be compared with the aforementioned authors’ result, which are obtained with the INSPEcT tool. However, it was not possible to analyse 12 transcriptional units, because they had negative expression levels. At the end, we retrieved a dataset of premature, mature and nascent RNA expression levels for 4897 genes in 3 replicates and for 11 time points.
Figure 1 D reports two examples from the dataset: the first one represents a typical transcriptional regulation, as can be seen from the adherence of the three profiles, while the second one is probably associated with a more complex post-transcriptional scenario.
3. The likelihood
For each gene g ∈ {1, …, G}, time-point t ∈ {t1, …, tT} and replica h ∈ {1, …, H}. Let Yg,h(t) = (Y1,g,h(t), Y2,g,h(t), Y3,g,h(t))′ denote the measured expression levels of premature, mature and nascent RNA.
We assume that the observations Yg,h(t) are noisy versions of the true unobserved values denoted with xg(t) = (pg(t), mg(t), ng(t))′. Since Yg,h(t) needs to have positive components, we model it as follows: where N>0(μ, Σ) indicates a (truncated) normal distribution with mean vector μ, covariance matrix Σ, and with the components restricted to ℝ+. In our case, the covariance matrix is diagonal, with diagonal elements collected in the vector τ g(t) ∈ (ℝ+)3. The vector ρh(t) = (1, 1, ρh(t)) is a scaling factor that is required to normalise the nascent RNA libraries to the pre-existing RNA counterparts (Miller et al., 2011; Rabani et al., 2011, 2014; de Pretis et al., 2015)
As shown in previous works, see, for example, Pavelka et al. (2004) and Subramaniam and Hsiao (2012), xg(t) affects both the mean and the variance of Yg,h(t). The effect of xg(t) on the variance of Yg,h(t) is modelled through a linear relation between the logarithms of their components:
The subject of the next subsection is the mathematical model of the latent gene expression levels xg(t).
3.1. A mathematical model of the RNA life cycle
At the current state-of-the-art the life cycle of RNA molecules, in eukaryotic cells (e.g. mammals and plants), is divided into three sub-processes (Figure 1 B). The first is the synthesis of premature RNA from DNA. This portion of the transcriptome is located inside the nucleus and it is not ready to perform its original task (e.g. protein translation). Premature RNA requires structural modifications and/or exporting to the cytosol. These steps constitute the second stage of the RNA life cycle, which is named processing. The product of premature RNA processing is mature RNA, which can eventually be degraded by the cell that concludes the RNA life cycle. The process may be described by the following network of chemical reactions where Pg and Mg denote premature and mature RNA for gene g, respectively. The empty-set symbols are used to emphasise that premature RNA is synthesised from DNA without consuming resources, and mature RNA is subject to degradation. The symbols k1,g(t), k2,g(t) and k3,g(t) are the KRs of the synthesis, processing and degradation respectively; they are both time and gene dependent. A system of ODEs that translates the reaction network (2) in mathematical terms is the following
Indeed, the effect of the processing of premature into mature RNA at rate k2,g(t) is to decrease pg(t) and correspondingly increase mg(t). The degradation of mature RNA decreases mg(t) at rate k3,g(t), while the synthesis increases pg(t) at rate k1,g(t).
It is well known that, for the model described so far, it is very difficult to identify all three KRs. Anotherr variable, the so-called nascent RNA, is usually included to ameliorate the identifiability,(Dölken et al., 2008; Miller et al., 2011; Rabani et al., 2011, 2014; de Pretis et al., 2015). Nascent RNA is the amount of total RNA (both premature and mature) synthesised by the cell in a short span of time and it can be quantified by 4sU-Seq (Figure 1 C). Nascent RNA is, by definition and a ccording to the experimental set-up, absent at the beginning of the experiment. It is produced during the brief labelling time (tL), according to the same dynamics as the pre-existing counterpart. However, the effect of degradation, in such a short time, can be neglected. The expression level of the premature and mature nascent RNA is therefor ruled by the following equations
The sum is the nascent RNA level. By summing the two previous equations, one obtains that the amount ng(t) of nascent RNA only varies according to the effect of the synthesis rate:
Since the time window for which nascent RNA evolves is short (tL), this rate can be considered approximately constant and equation (4) can be integrated to obtain: which is the third equation that is added to model (3). Equation (5) facilitates the estimate of the synthesis rate.
4. KRs parametrisation
In several biological experiments, a cell culture is perturbed and gene expression levels are repeatedly measured after the perturbation in order to understand which genes are involved in the response. By adopting the model described above, it is possible to shed light on the fundamental mechanisms that a cell uses to regulate gene expression levels by modulating the of synthesis, processing, and degradation rates.
The typical shapes that we expect the rates to take on in response to a certain perturbation of the environment are generally assumed to be constant (some rates are not altered), monotonic (both increasing and decreasing), and peak-like functions. These shapes account for both permanent and transient modulations of the KRs and have already been applied successfully to describe transcriptional and post-transcriptional responses in several biological systems (see, for example Chechik and Koller, 2009; Rabani et al., 2011, 2014; de Pretis et al., 2015).
The first novelty of our proposal is that we introduce a unique parametric family of functions which, for different values of the parameters, can cover all such characteristic shapes. Others approaches use different functional forms and, for each gene, select the best one with external criteria, e.g. the log-likelihood ratio test.
Let ϕ(·| μ, σ2) be a Gaussian density with mean μ ∈ ℝ and variance σ2 ∈ ℝ+. We define the family of functions f (t| μ, σ2, κ−∞, κμ, κ+∞) to which all the KRs belong in the following way: where κ−∞, κμ and κ+∞ all belong to ℝ+. The function f in equation (6), is obtained by applying different scalings and vertical translations of a Gaussian density to its right and left halves, with respect to the mean value μ, taking care to preserve continuity at time-point t = μ (Figure 2). It is easy to see that:
Examples of the forms that can be obtained with (6), by changing its parameters, are shown in Table 1. As we can be seen, all the standard shapes (constant, increasing/decreasing, peak-like) are possible.
For easiness of interpretation, we split and rename the parameters as follows. First, we single out κ−∞ and we rename it β to simplify the notation. Unlike the other parameters, which are related to the response, β is the baseline level, i.e. steady-state, and it is analysed separately. Secondly, we introduce the logarithmic ratios
These quantities are called log-fold-changes in computational biology and are usually used to measure modulations with respect to the baseline level f (t| μ, σ2, κ−∞, κμ, κ+∞). We define
The parameters μ, σ2, η1, η2 are all related to the characterisation of the response to perturbations. In particular μ and σ2 characterise the temporal location and duration of the response, while η1 and η2 determine the typical shape, as highlighted in Table 1. We collect these four parameters into a single vector that we denote with θ, to obtain a more compact notation.
The family of functions (7) can now be re-parametrised as f (t | β, θ) with
Identification constraints
Although the family of functions (7) is well defined for all real values of μ and positive values of σ2, identifiability and interpretability issues can arise if some conditions are not met. For example, if μ is smaller than the first observed time t1, and σ2 is small, the function f in the interval [t1, tT], with any arbitrary choice of η1 and η2, is indistinguishable from a constant one, which should instead be given by η1 = η2 = 0 (Table 1). For this reason, identifiability constraints are needed.
A main requirement is that the value of the function (6) at time-points t1 and tT should be “close” to κ−∞ and κ∞, respectively, which means that the most relevant part of the function graph lies within the observed time window. For peak-like shapes (η1η2 > 0, cf. Table 1) we ask that which implies two conditions:
Notice that, one of the log-fold-changes vanishes for monotonic shapes, however, we can still derive an identifiability condition by requiring that (9) holds, e.g. when η1 = 0, we have a c-up or c-down shape, and κμ = κ−∞. Since the function is constant from t1 to μ, the first equation of (9) holds if, and only if, t1 ≤ μ. The conditions we obtain for c-up and c-down shapes (η1 = 0) are therefor
Similarly, for up-c and down-c shapes (η2 = 0), we obtain
We should take into consideration that monotonic up-up and down-down are intermediate states between, respectively, a c-up and an up-c, and a c-down and a down-c, respectively. The limits on μ should be close to (11), if η1 < η2, and close to (12), if the opposite is true. Therefore, we define the constraints on μ as: where and ξT = 1 − ξ1
The subset of the parameter space where the identifiability constraints hold is denoted with , and is defined by the condition β > 0 and one of equations (10), (11), (12) or (13), depending on the signs of η 1 and η 2, which are instead real numbers that are free from any constraint. Although identifiability is only granted in , we prefer not to force the parameter to belong to this set for easiness of implementation, but, as we explain in the next section, we introduce an approximated likelihood that gives very little support to parameter values outside .
5. The latent clustering models
As mentioned in Section 4, βj,g is interpreted as the baseline level of the rate kj,g(t), while θj,g = (μj,g, σj,g, η1,j,g, η2,j,g) characterises the modulation of the rate in response to a perturbation of the environment. It is biologically reasonable to allow groups of genes that have a similar baseline level to respond differently to a perturbation. For this reason, we introduce mixture models, based on the DP, at the bottom level of the model hierarchy, for the parameters βj,g and θj,g, for each j ∈ {1, 2, 3}.
Unlike most of the DP applications, where the mixture is at the first level of the hierarchy i.e. on the observed data level, we introduce mixture models over the parameters of the non-observable KRs which define the time-varying coefficients of an ODE system. For this reason, it is necessary to ensure that the sampling of the DP is easy, with as many Gibbs updates as possible for the mixture parameters.
Moreover, we also want to ensure that the model can discriminate between the possible shapes of kj,g(t). It is not trivial to define a distribution over the domain , i.e. the space where the parameters (βj,g, θ j,g) are identifiable. What we propose here is to let (βj,g, θj,g) be defined over the whole ℝ5 but, each time they are outside , the posterior distribution must have a very small density, which means that parameters outside are almost never sampled in the MCMC algorithm. We do this by changing the data likelihood (equation (1)) with the following one: which gives an almost null posterior support to all the parameter values outside .
A second issue that has to be solved is that if the marginal prior distribution of the log-fold-changes η1,g,j and η2,g,j is continuous, then the posterior probability that at least one of the two is exactly equal to 0 vanishes, which means that we are not able to estimate a constant kj,g(t) (or c-up, c-down, up-c, and down-c shape). One possible solution is to use a distribution that is continuous over (−∞, 0) (0, ∞) and has a point mass at 0. We can do this by introducing the new variables and , related to η 1,g,j and η 2,g,j through the following:
If we assume a continuous distribution for , as a result of (14), the distribution over η1,g,j is continuous over (−∞, 0) (0,∞) and has a point mass at 0 equal to the cumulative distribution of between −ξ and ξ. A similar result holds for η2,g,j.
We can now work with the parameters βj,g and , which are all defined over ℝ, and we obtain continuous distributions. We then define 6 mixture models, based on Gaussian densities, 3 of them over β1,g, β2,g and β3,g and the others over and . In other words, the models are DP Gaussian mixtures (Neal, 2000) formalised as: and
In models (15) and (16), and are the discrete random variables that represent the labels which identify the component of the mixture to which the parameters belong. These variables are assumed to come from a discrete distribution, whose probabilities follow a Dirichlet Processes defined by the GEM (or stick-breaking) distribution (Gnedin et al., 2001). Given that the allocation variable, λj,g and are normally distributed with parameters that have standard priors. All random quantities in model (15) can easily be updated in the MCMC algorithm using only Gibbs steps, thereby facilitating the implementation of the model.
6. Real data application
Before the discussion of the results obtained from the motivating dataset, we first show how our model performs with respect to some competitive approaches.
6.1. Multiple inference method comparison
We compare the performance of our model (M1) with a simplified version of our proposal, which assumes for all j and g (i.e. no mixture models, a single distribution for each parameter - M2), and the frequentist approach implemented in INSPEcT (M3). Comparisons are conducted in terms of predictive power and interpretation.
The model is implemented in R/C++ and uses OpenMP (OpenMP Architecture Review Board, 2008) for parallel computing. Our method is estimated using 32 cores, 2500000 iterations, burnin 2425000, thin 30, therefor with 2500 posterior samples, with ξ = 10; the computations take 20 days. We set M= 0, V= 100I, M0 = 0, V0 = 100, Ψ= I, ν = 5 and ν0 = 2. We use a N (0, 100) for βj,0, βj,1 as prior distribution, while each SFh(t) has a G(1, 1). Since the number of latent mixtures in the DP depends on parameter , we view it as a random parameter with G(1, 1) prior. We use the same priors and iterations for M2, while, for M3, we retrieved the modelled KRs for the 4897 genes of interest from the INSPEcT object, released as supplementary material by de Pretis et al. (2017).
First, we want to assess a goodness-of-fit for the three models for each Yj, to evaluate wheter our proposal can be considered better than the others. For this purpose, we use the continuous ranked probability score (CRPS) (Gneiting and Raftery, 2007), since this index is suitable for comparing the predictive ability of models based on different data distributional assumptions. We compute this index for each modelling approach and experimental condition (i.e. RNA species, replicate, and time-point) and, in Table 2, we report the fraction of the times the first model has a lower CRPS index than the second one, for any pair-wise comparison. The results show that M1 is better able to recapitulate the data.
We extend the analysis by computing pairwise Pearson’s correlations between the CRPS indexes estimated for premature, mature and nascent RNA, for each model. The correlations are between 0.34 and 0.62, and interestingly, they are always higher for M1 and M2, compared to M3 (0.52, 0.52 and 0.38 on average, respectively, Figure 28-SM). It follows that the Bayesian approaches tend to fit all the RNA species profiles with similar goodness while the frequentist one, generally, recapitulates the profile of one experimental quantity better than the others. This is not desirable in the current application scenario, since any RNA species is potentially equally informative about the underlying regulations of the RNA metabolism.
The fraction of rates modeled as constant, monotonic, or peak-like functions are reported In Table 3, for each inference method. In the case of modulation, we also distinguish between increasing (+) or decreasing (−) kinetics (for peak-like responses, η1 > 0 and η2 > 0 or η1 < 0 and η2 < 0, respectively). MYC is a transcription factor, and we can therefore expect a primary response at the synthesis level. This is in particular the case for M1 and M3, which have less than 2% of the genes constant in synthesis. The percentage is higher for M2 (10%). Nevertheless, k1 is still by far the most variable rate.
Focusing on the genes modulated in synthesis, we can observe that all the inference methods predict a higher fraction of repressed genes (monotonic- or peak-) than the induced counterpart (0.49−0.63 versus 0.36−0.42). However, the complexity of the resulting models is different, with 49% of the modulations in synthesis described by M3 as peak-like functions against 20% and 26% for M1 and M2, respectively.
The higher complexity of the synthesis rate profiles predicted by M3, is accompanied by a lower fraction of post-transcriptional modulations. M3 in fact predicts higher percentages of constant k2 and k3 (71% and 80%, respectively) than the Bayesian approaches (between 46% and 70%). This means that M1 and M2 tend to explain the expression more as a choral action of the three kinetic rates than M3 (Figure 3 and 29-SM).
Despite the prevalent role of synthesis in shaping the MYC response, indirect and less intense post-transcriptional regulations are expected due to, for example, the known feed-backs that link the three layers of the RNA life cycle (Sun et al., 2012; Eser et al., 2014; McManus et al., 2015).
In light of the higher goodness of fit of M1 than M3, we can conclude that INSPEcT fails to capture the more complex regulatory scenarios inferred by our novel Bayesian method.
6.2. MYC response analysis
In this section, we analyse the model inferred by means of our approach in response to MYC activation in more detail. For simplicity, we compute the MAP clusterisation for each gene and variable, and we only discuss clusters that have at least 150 associated genes. A heatmap and a set of boxplots that recapitulate the temporal behaviour of the synthesis rate of log-fold-changes, compared to the initial time-point, are reported in Figures 4, for each cluster, ordered by the number of elements. We also display the μ versus σ2 and η1 versus η2 plots. Both of these graphs provide valuable information about the shape of the responses in the cluster of interest (Table 1). Figures 6 and 7 are analogues of processing and degradation rates, respectively.
In order to support the discussion, we perform enrichment analyses, based on Gene Ontology (GO) annotations (Ashburner et al., 2000; Dessimoz and Škunca, 2017; The Gene Ontology Consortium, 2019), for each rate. For the sake of simplicity, the results of these analyses are mainly shown in the supplemental material. However, they are summarized in the main text, while Figure 5, which is reported as an example of a full output, refers to a specific case which has been selected both because of its interest and its graphical clarity. The GO enrichment analysis exploits an ontology which can be defined as a set of terms which, in our case, are pertinent to biological processes (e.g. “regulation of mRNA stability”, “cellular response to stress” etc.), associated by means of relations (grey edges in Figure 5 - e.g. “is a”, “regulates” etc.). Each term is also matched with a set of relevant genes by means of a curated annotation, which is constantly updated according to the literature, in order to reflect the knowledge of the scientific community on the biological domain. Given a set of genes, it is possible to search for those terms that are over-represented in the group of interest, compared to a background (a number of associated genes, that is proportional to the node size in Figure 5), that is a larger set which potentially accounts for all the annotated transcriptional units. A hypergeometric test is usually performed for any possible term to statistically test the enrichments, and a threshold is then imposed on the corrected p-values (node colour in Figure 5), or q-values, selecting the most significant results. These terms (labels in Figure 5) provide a broad overview of the biological processes involved with the selected genes. We perform these analyses using the Bioconductor R-package clusterProfiler (Yu et al., 2012).
Synthesis rate
The graphical description of the clusters associated with the synthesis rate is shown in Figure 4.
Cluster 1 accounts for 1293 genes characterised by peak-type functions (η1 and η2 < 0) with the minimum between 1 and 3 hours and a small variance, which indicates quick responses. This behaviour can be explained as a side effect of MYC activation, which causes the polarisation of the resources required for the proficient transcription of the induced target genes (discussed in the following clusters). This response is compensated in the recovery of the transcriptional activity of these genes which then results, for a subset of such genes, in a light induction (η1 > η2). This indicates that these transcriptional units are required by the cell for its homoeostasis.
The enrichment analysis points to such as basic processes as DNA metabolism and cell-cycle regulation, which are in line with the expectations, because they are known to be perturbed in an MYC-dependent manner (Dang, 2012; Chen et al., 2018); see Figure 1-SM.
Cluster 2 accounts for 1103 genes, characterised by a monotonic decrease of the synthesis rate (η1 < 0 and η2 > 0). The temporal response is more heterogeneous with μ and σ2 spanning large domains. This cluster contains the genes that are involved in cell growth and development, cell adhesion, migration, and apoptotic signalling regulation (Figure 2-SM). All these terms are interesting clues that point to the role played by MYC in cancer biology (Dang, 2012; Chen et al., 2018).
Cluster 3, which accounts for 904 genes, is composed of monotonic increasing responses (η1 > 0 and η2 < 0) concentrated in the first 5 hours of the time-course. This behaviour indicates a potential direct MYC regulation. These genes are involved in coding and non-coding RNA metabolisms (Figure 5). They affect RNA localisation, e.g. exporting from the nucleus, RNA splicing and non-coding RNA processing, and also RNA stability, e.g. the RNA catabolic process. Moreover, these genes are related to RNA modification, an emerging dynamic regulatory layer of the transcript metabolism (Roundtree et al., 2017). For example, N 6 methyladenosine is a methylated nucleotide (methylation, RNA methylation, macromolecule methylation) which is pervasive in the transcriptome of various species, e.g. mice, with a well established role in the control of transcript stability (Wang et al., 2014) and translation (Wang et al., 2015). Increasing evidence also links this RNA modification to synthesis and splicing (see for example Louloupi et al., 2018; Furlan et al., 2019).
The analysis of these terms provides a coherent picture that relates MYC activation to several regulation layers of the transcript metabolism and translation (e.g., the cellular amino acid metabolic process, regulation of translation). This evidence supports the post-transcriptional rate modulations predicted by our approach and, partially, by INSPEcT.
Cluster 4 accounts for 392 monotonic + and − (η1 ≈ 0 and η2 < 0 or η1 < 0 and η2 ≈ 0, respectively) late responding genes (μ generally larger than 4 and up to 14). These transcriptional regulations are probably secondary responses and are clustered together due to their temporal features.
Clusters 5 and 6 are composed of 299 and 246 monotonically induced genes (η1 ≈ 0) in response to MYC activation. These two clusters were split, due to their temporal responses, which are faster and more homogeneous in Cluster 5 than in Cluster 6. The latter is partially related to non-coding RNA processing (see Figure 4-SM).
Finally, Cluster 7 is composed of 182 genes characterised by a weak peak-like induction (η1 and η2 > 0), that is earlier in the time-course and quicker than those belonging to Cluster 1, as can be seen from the values of μ and σ2 respectively.
Our model also provides a clustering of genes according to their steady-state values of the synthesis rate (β1). The model identifies 7 c lusters, e numerated w ith a progression of letters, with more than 150 genes. The Gene Ontology enrichment analysis returned more confused and less significant t erms t han t hose d iscussed a bove. C luster B i s partially related to cell-cycle regulation, while Cluster E and Cluster F are related to RNA metabolism (Figures 15-SM, 18-SM and 19-SM). Interestingly, a remarkable percentage (≈30%) of the genes in Cluster E and Cluster F is also part of Cluster 3 (Figure 26-SM). Noticeably, these are also the clusters that are characterised by the fastest kinetics, which is a required condition, even though not sufficient, to quickly regulate the expression level of a gene, and is a clue of the fundamental regulatory role played by these transcriptional units (Figures 27-SM).
Processing rate
A graphical description of the clusters associated with the processing rate is given in Figure 6. Cluster 1 is composed of 3537 elements, which respond to a great extend within 3 hours from the stimulus with a moderate and sharp monotonic increase of k2. Because of the timing of the response and the extension of this cluster, this behaviour could be due to the general feedback mechanisms which link RNA synthesis and processing.
On the other hand, Cluster 2 is composed of 1212 genes characterised by small and late, both positive and negative, monotonic responses. These are probably secondary responses mediated by the remarkable number of genes involved in the RNA processing regulation perturbed by MYC activation. The weakness of these modulations is puzzling but, since they take place in the most coarse-grained part of the time-course, our approach may only detect a reflex of the real biological regulation, that could occur between the experimental observations. It would be interesting to further investigate this population of transcripts through an experiment, with an ad-hoc temporal design. However, this is beyond the scope of this manuscript.
The Gene Ontology enrichment analysis of Cluster 2 does not provide any significant terms, while several biological processes, already mentioned while discussing the synthesis rate response, can be found for Cluster 1. However, the enrichments are less significant (Figures 5-SM and 6-SM) and they disappear when the 4897 differentially expressed genes are used as the background instead of all the annotated ones (Figures 17-SM and 18-SM).
The same is true for all of the 5 clusters with more than 150 genes which our method returns for the processing steady-state rate.
Degradation rate
The modulations of the degradation rate are divided into three clusters with more than 150 genes each, see Figure 7. Cluster 1 is composed of 3094 elements characterised by very quick peak-type responses and with the minimum between 1 and 2 hours. As we have seen for k2, this behaviour may be due to the coupling with the synthesis rate. Cluster 2 and Cluster 3 account for 1279 and 438 genes, respectively, characterised by late, both positive and negative, monotonic responses, which are probably secondary. The situation is similar to the one previously described for late processing rate modulations especially for Cluster 2, and could point to indirect regulations of the stability of the transcripts under-sampled due to the temporal design of the experiment. The Gene Ontology enrichment analysis results are analogues to those we discussed for the processing rate (Figures 7-SM and 8-SM), and also to those of the 5 steady-state clusters.
7. Final remarks
Motivated by a real data application, we here propose a Bayesian approach to the analysis of RNA expression levels. In our proposal, the experimental data are hypothesised to be noisy observations of a true process, which is a solution to a system of ODEs. We assume that the ODEs depend on KRs, which are time- and gene-dependent. The KRs are the main object of inference, since they characterise the RNA life cycle and provide important insights into the analysis of gene expression levels. The temporal evolution of KRs is encoded in a single family of functions, defined by only 5 parameters that can be easily interpreted from the biological perspective (i.e. initial value, relative log-fold-changes, and temporal location and duration of the response). The parameters are divided into two groups, according to their role in defining either the initial value of the KR or its temporal modulation. A mixture model, based on the DP, is defined for both of them, and for each KR. This allows us to find sets of genes with similar KRs shapes or steady-state values to guide the inference. This approach is conceptually based on the well-established co-regulation of genes, which a cell often exploits to coordinate of the expression level of multiple transcripts required to operate a specific task. Therefore, the idea of including a clustering step in the inference process is not only biologically robust, but also provides a valuable piece of information.
Some identification problems have arisen, owing to the complex structure of our model. However, we have managed to solve them using latent variables and identification constraints in such a way that the MCMC algorithm can still only use Gibbs updates for the mixture parameters.
The results obtained with the new method are biologically relevant. The enrichment analysis of the clusters results in meaningful sets of terms in the context of MYC biology, and which are in conceptual agreement with the shape of the responses. This is particularly true for the synthesis rate, which is the most informative regulatory layer in this specific biological system. However, our m ethod manages to identify a remarkable fraction of genes as post-transcriptionally regulated, thus pointing to weak responses that are compatible with the adjustments of processing and degradation rates in response to the transcriptional perturbations and indirect secondary responses concentrated in the last portion of the time-course. Our method tends to describe the regulation of RNA metabolism more as a choral action of the three kinetic rates than the frequentist approach. The inclusion of Dirichlet-based clustering in the inference improves the goodness of fit.
A limitation of our method is the independence of the synthesis, processing and degradation rate clusters. In principle, this could be overcome by defining a mixture model on the parameters that shape the response of multiple rates. However, this inference would take place in a much larger space, which would be difficult to handle. We preferred to acquire a full picture of each single response for our analysis, but we anticipate that our framework could also be adapted in this way.
We conclude by stressing that this proposal represents an inference framework for chemical reaction network coefficients which could be used to improve the methods currently available in computational biology to study dynamic phenomena in large omics datasets.
8. Implementation
The source code that implements the methodology is available at https://github.com/GianlucaMastrantonio/Multiple_latent_clusterisation_model_for_the_inference_of_RNA_life_cycle_kinetic_rates.
Acknowledgements
The work of the first two authors has partially been developed under the MIUR grant Di-partimenti di Eccellenza 2018 - 2022 (E11G18000350001), conferred to the Dipartimento di Scienze Matematiche - DISMA, Politecnico di Torino. The computational resources were provided by the Dipartimento di Scienze Matematiche - DISMA, Politecnico di Torino. The authors would like to thank Mattia Pelizzola (CGS@SEMM - IIT) for his critical comments on the manuscript.
Footnotes
Wrong references to tables fixed