Bayesian model-based method for clustering gene expression time series with multiple replicates

In this study, we introduce a Bayesian model-based method for clustering transcriptomics time series data with multiple replicates. This technique is based on sampling Gaussian processes (GPs) within an infinite mixture model from a Dirichlet process (DP). Our method uses multiple GP models to accommodate for multiple differently behaving experimental replicates within each cluster. We call it multiple models Dirichlet process Gaussian process (MMDPGP). We compare our method with state-of-the-art model-based clustering approaches for handling gene expression time series with multiple replicates. We present a case study where all methods are applied for clustering RNA-Seq time series of Clostridium botulinum with three different experimental replicates. The results obtained from the gene enrichment analysis showed that the number of significantly enriched sets of genes is larger in the clusters produced by MMDPGP. To demonstrate the accuracy of our method we use it to cluster synthetically generated data sets. The clusters produced by our method on the synthetic data had a significantly higher purity score compared to the state-of-the-art approaches. By modelling each replicate with a separate GP, our method can use the natural variability between experimental replicates to learn more about the underlying biology.


Introduction
Clustering genes based on their transcription dynamics can yield invaluable information about the roles of these genes in different biological processes and their complex interaction network.The transcription dynamics is captured by gene expression time series, which can be collected in multiple replicates to take in consideration the biological variation and measurement errors [1].
Traditional clustering methods such as k-means, hierarchical clustering or self-organizing maps rely on correlation or Euclidean distance, which may not account for dependencies in gene expression levels across time points [2].Alternatively, model-based clustering methods assume that the data are generated from a mixture of underlying probability distributions, each representing a different cluster [3].However, previously published model-based clustering methods for gene expression time series have put little to no attention in considering the variability of experimental replicates.
In this work, we present a Bayesian model-based clustering technique for multi-replicate transcriptomics time series, that uses a Dirichlet process (DP) [4] for generating clusters and multiple Gaussian process (GPs) [5] models to represent the different replicates inside each cluster.We call our method multiple models Dirichlet process Gaussian process (MMDPGP) mixture model.This approach incorporates a measure of uncertainty to the cluster assignments.It also provides the added benefit of automatically identifying an optimal number of clusters based on different optimality criteria, instead of arbitrarily setting it to a fixed value like in the case of k-means clustering for instance.
We compare MMDPGP with two similar model-based approaches proposed by Hensman et al. [6] and McDowell et al. [2].The approach presented by Hensman et al. is called mixture of hierarchical Gaussian processes (MOHGP) and it models the replicates using a single GP model at each cluster and samples the clusters from a DP.The software implementation of MOHGP indicated in the original Hensman et al.May 29, 2024 2/17 paper [6] was based on a variational inference technique.In order to provide a better comparison we implemented it using a Gibbs sampler written in Python similarly to [2].
On the other hand, McDowell et al. handle multiple experimental replicates by using the average of the replicates at each time point as the data in their GP model.They call their method Dirichlet process Gaussian process (DPGP).Unlike these alternative methods, MMDPGP provides a finer and more detailed modelling of each replicate which could help uncover clusters of genes that always behave similarly with each other, but together may behave differently across different replicates because of natural variability.
We use the methods for the purpose of clustering recently collected RNA-Seq time series.The RNA-Seq data were obtained by collecting RNA samples from growth experiments of three replicate cultures of Clostridium botulinum strain ATCC 19397 in different time points.

Related Methods
Clustering gene expression patterns according to their similarity has been the subject of many studies [2,[6][7][8][9].Rasmussen et al. [7] proposed a Dirichlet process mixture (DPM) model as an alternative to the bootstrap approach for modeling uncertainty in gene expression clustering.This method calculates the likelihood of two genes being in the same cluster as a measure of similarity.It does this by considering the primary factors of model uncertainty, such as the number of clusters and their location.They used a Gibbs sampler to implement their model and apply it to cluster high-dimensional non-time series gene expression data.
Medvedovic et al. [8] developed different clustering procedures based on Bayesian mixture models.Their method design specifically targeted microarray gene expression data with experimental replicates.They model the distribution of their data as a Bayesian mixture model and consider both finite and infinite Bayesian mixture models in their modeling approach.However, they do not present any application of their method on time series data.
Cooke et al. [9] introduced an extension to the Bayesian hierarchical clustering (BHC) algorithm [10] for use with time series data, and tested it on microarray time series.BHC uses a one-pass, bottom up greedy approach which assigns each data points to a separate cluster and iteratively merges cluster pairs together similarly to traditional agglomerative clustering.Their algorithm utilizes GP regression to model the structure of the data and can accommodate multiple replicate measurements of genes to enable a better discrimination of gene expression profiles.
Hensman et al. [6] introduced a clustering method which they called mixture of hierarchical Gaussian processes (MOHGP), that is based on GPs and DP.Their method can cluster time series with multiple replicates, but they consider only one way of modeling replicates which consists on clusters that are parameterized by a single GP.
They provide an efficient implementation for MOHGP based on a variational inference procedure.
McDowell et al. [2] presented a method for clustering gene expression time series by combining a DP with a GP, and they called it simply DPGP.Their method is specialised in clustering gene expression time series consisting of a single replicate.If multiple replicate measurements are provided, it simply calculates their average and uses that in the rest of the analysis, thus not considering the variability between replicates.
In this work we focus on non-parametric clustering methods for time series which can handle the information provided by different replicates.Here we propose MMDPGP, a method for clustering gene expression time series which leverages the information provided by the variability between different experimental replicates.We

Methods Gaussian Processes
Gaussian processes are non-parametric methods that are used for building probabilistic models of functional data [5].In many previous studies, GPs have been widely recognized and adopted for modeling transcriptomics data [1,2,[11][12][13][14][15][16].A GP is completely characterized by a mean function m(t) and a covariance (or kernel) function k(t, t ′ ), which is often referred to as the kernel of the GP.It defines a probability distribution over functions f (t) which we denote as: Choosing the kernel is an important step in the process of modeling with GPs as it determines our assumptions about the function that we want to model.The most widely used kernel function for modelling with GPs is the exponentiated quadratic function [5] which is defined as follows: The exponentiated quadratic kernel function is characterized by 2 parameters which are the length scale l and the variance σ 2 f .The length scale determines the oscillatory behavior of the function, specifically influencing the extent of fluctuations.Generally, the predictive capability of the model is limited to a range not exceeding l units from the observed data, indicating constraints on the model's extrapolation potential beyond this threshold.The variance determines the average distance of the function away from the mean.We will also consider a third parameter σ 2 n which is the Gaussian noise and represents the observation noise of the data in the GP model.In the rest of the analysis when we write k(t, t ′ ) it is implied the exponentiated quadratic kernel unless specified otherwise.Our method uses GPs as a parametrization for each cluster of genes (i.e., all genes within the same cluster share the same GPs' profiles).

Infinite mixtures of Gaussian processes
Bayesian nonparametric (BNP) techniques differ from traditional methods in addressing the complexity of models.In standard cluster analysis methods like k-means, it is required to evaluate a range of models, each with a different parameter value (e.g., k = 1, . . ., N clusters), then the optimal model is selected based on various metrics for model comparison.Traditional methods are characterized by a finite number of parameters.In contrast, BNP methods are grounded in a statistical framework that can accommodate a potentially infinite number of parameters, this characteristic makes them flexible and removes the burden of choosing the number of parameters which could best model the system.
The Dirichlet process (DP) [4] is an example of a BNP technique which we use in our clustering method.DP is a nonparametric probability distribution over probability distributions, and it is used as a flexible prior for unsupervised learning tasks like clustering and density modeling [17].We denote a probability distribution sampled from a DP as: where α is a positive real number called the concentration parameter, and H is a probability distribution called the base measure.
In the context of clustering, a DP is used to generate priors for a Dirichlet process mixture model (DPMM) which is a mixture model that accounts for a theoretically infinite number of mixture components [18].Several authors have extended the concept of DPMM to include GP models and call them infinite mixture of Gaussian process experts, infinite mixture of global Gaussian processes or Dirichlet process Gaussian process models [2,6,[19][20][21][22].
In this paper we use a Dirichlet process Gaussian process model to cluster time course transcriptomics data using multiple replicates.Below we define the data and the model used for the clustering.
where g ij represents the gene expression value of gene i at time t j .We denote by f i (t) the latent function that models the time series [(t j , g ij )] T j=1 representing the gene expression values of gene i at every time point t j .Then, we define the following generative model for our data: where G is a sample from DP (α, H) as defined in Eq (3) (small values of α generate less clusters with bigger sizes, while large values generate more clusters with smaller sizes), and θ κ = {m κ (t), l κ , σ f,κ , σ n,κ } is a set of GP latent parameters for cluster κ which is drawn from the distribution G. Finally, we set f i (t) to be a sample drawn from the GP with mean function m θκ (t), exponentiated quadratic kernel parameterized by l κ , σ f,κ , and observation noise σ n,κ .It is possible to express the conditional distribution of a randomly sampled cluster parameter θ κ dependent on all the previously sampled parameters θ \κ by integrating over G and obtain: where δ is the Kronecker delta function, and θ \κ is the set of all sampled parameters except for θ κ .This suggests a practical solution to the problem of sampling from a DP because drawing a concrete distribution G is impossible in practice.

Model selection for fitting multiple replicates
Gene expression time series can be collected in multiple replicates to take in consideration the biological variation and measurement errors [1].However, the May 29, 2024 5/17 information provided by multiple replicates is often disregarded by clustering methods or they are used merely for estimating an average value which is later used in the downstream analysis as it is the case in [2].
In this paper we construct and analyze 3 different approaches which take in consideration all replicate measurements of genes in the process of clustering genes by using a DP and GPs.The first method, MOHGP, was proposed by Hensman et al. [6].
The second method, DPGP, was proposed by McDowell et al. [2].The third is our proposed approach, that we call multiple models DPGP (MMDPGP).Conceptually they involve changing Eq (4c) in the generative model defined by Eq (4a)-(4c).Below we explain the details for each approach.

MOHGP (Hensman et al.)
In the case of MOHGP for each cluster κ, Hensman et al. [1] define a regression problem over functions f i : t → f i (t), which are generated by the following model: We denote by g ij ] T j=1 representing the gene expression values of gene i at every time point t j of replicate r.The model assumes that they are noisy observations of f i (t).If we denote by µ i = [f i (t j )] T j=1 the vector representing the values of f i (t) at each time point t j , then the conditional distribution for g (r) i can be written as follows: where σ 2 n represents the noise variance between all replicates across all genes inside the cluster.
To make inference easier, f i (t) is marginalized out and the model is expressed by conditioning on m κ (t).In the marginalised model, the covariance between two observations g (r) ij and g where δ is the Kronecker delta.
Let K HEκ be the covariance matrix defined by the kernel function in Eq (8).If we denote by µ κ = [m κ (t j )] T j=1 , the vectors of observed values g (r) i can then be sampled from the following multivariate normal distribution: The value of µ κ is determined as an approximation by setting it as where N κ is the number of genes in the cluster.The generative model MOHGP can be May 29, 2024 6/17 written as: The computational complexity for fitting the GP of a single cluster in this model is therefore O(N κ (RT ) 3 ) in time and O(N κ (RT ) 2 ) in space, where N κ , T , R represent the number of genes in the cluster, the number of time points for which we have measurements and the number of replicates.

DPGP (McDowell et al.)
Similarly to the previous approach, DPGP assumes that the latent mean function of each observed gene expression time series is sampled from cluster specific GP like in Eq 6.However, they limit their model by considering gene expression time series, which consist only of a single replicate.When multiple replicates are provided, DPGP collapses them into a single time series by computing their mean at each observed time point.Following the derivations of MOHGP from the previous section, but in the case of a single replicate, the generative model for DPGP can be written as: The computational complexity for fitting the GP of a single cluster for DPGP is O(N κ T 3 ) in time and O(N κ T 2 ) in space.

MMDPGP
In the case of MMDPGP we take another approach by assuming that the set of functions f i : t → f i (t) are generated from replicate specific GPs within each cluster κ and we denote this association with superscript r (i.e, f i (t)).The generative model for f (r) i (t)) will then be If we denote by µ i (t j )] T j=1 the vector of evaluations of f (r) i (t) at time points t j , the vector g (r) i representing the gene expression values of gene i at every time point t j of replicate r can be sampled from the following replicate specific conditional distribution: where (σ (r) n ) 2 is the observation noise between genes of the same replicate in the cluster.In order to perform inference on the model we can marginalize out f  MMκ the covariance matrix in Eq (12) for genes in replicate r, we can sample g (r) i as: We can approximate the value of µ κ as µ (r) i , where N κ is the number of genes in cluster κ.The generative model of MMDPGP will then be: M Mκ . ( The computational complexity for fitting the GP of a single cluster in the case of MMDPGP is O(N κ RT 3 ) in time and O(N κ RT 2 ) in space, which is significantly lower than MOHGP but larger than DPGP.

Implementation
In this section, we focus exclusively on describing our implementations for MMDPGP and MOHGP (Hensman et al.).In the case of DPGP (McDowell et al.), we utilized their software, which is extensively detailed in [2].
In the Gaussian process Dirichlet process model (Eq (4a)-(4c)) the DP generates the distributions parameters θ κ = {m n,κ } R r=1 for the cluster specific GP(s).For m (r) κ (t) we denote by µ (r) κ the vector of its evaluations on the observed time points.We set the following conjugate priors for the variance and observation noise as suggested in [2], which fit the characteristics of our data: We set bounding constrains on the length scale l (r) where ∆t min = min(t i − t i−1 ) and ∆t max = max(t i − t j ) as described in [23], where t i , t j are time points from the gene expression time series.Finally, we sample the means of the replicates µ (r) i following Eq (6) and Eq (12) for MOHGP and MMDPGP accordingly.We initialise the parameters of each cluster with some default values (e.g., (l 0 , σ f0 , σ n0 ) = (1, 1, 0.2)) to determine the initial covariance matrices K MM0 , K HE0 and for µ 0 we sample a multivariate normal with zero mean, and covariance matrices K HE0 and K MM0 for each method respectively.
For the purpose of our clustering method we have used an implementation [2] of Neal's Gibbs sampling "Algorithm 8" [24], which yields the posterior probability for assigning a gene expression time series to a particular cluster.If we denote by c i the cluster of gene i and c \i the set of clusters of each gene except for gene i, the algorithm determines the probability for assigning gene i to any cluster κ conditioned on all other cluster assignments as follows: May 29, 2024 8/17 P (c i = κ|c \i , α) is a prior probability which does not depend on any characteristics of the time series for gene i but is based solely on the information about cluster κ: where m is the number of new clusters at each iteration which we set to 4, α is the clustering concentration parameter of the DP which we set to 1 as it has been found convenient on other gene clustering applications [25], N is the total number of clustered genes.However, while testing for different values of α and m we noticed that their impact on the number of clusters generated by MMDPGP is insignificant due to the good fit of the genes in the existing clusters.
i ] R r=1 for all replicate measurements of gene i are generated by the GP(s) which parameterize cluster κ, and is given by Eq (19) and Eq (20) for MOHGP and MMDPGP respectively.Additionally, µ κ , µ (r) κ are calculated as averages, like we indicated in MOHGP and MMDPGP sections.
In order to select the optimal clusters for the genes we use a posterior similarity matrix which tells the proportion of samples generated by the Gibbs sampler for which every pair of genes (i, j) is found to be on the same cluster as described in [2].
potentially involved in the same processes in what it is called as guilt by association [32].The gene expression data which consist of values in TPM (i.e., transcripts per million) units were then normalised using a spike-in-based normalisation.Genes with low expression values were then filtered as low values are not strongly reliable and can introduce a potential source of noise in the analysis.A detailed description of the data preprocessing including data preparation, normalisation and filtering is included in S1 Appendix.
Before fitting the transcriptomics time courses to our model we artificially pulled the last two time points of each time series closer to the rest -that is, the 15h and 24h time points were reassigned to 13.5h and 15h time points respectively.This is because the model assumes that the transcription dynamics of the genes happened with the same intensity through out the time course, which is not a realistic assumption in our case (i.e., in the beginning of the life-cycle [2h-12h] we expect rapid non-monotonic changes, while at the end the changes are very small and monotonic).Alternatively, the problem could be addressed by using a non-stationary kernel similarly to [33], but given that the change of transcription dynamic is expected only on the later time points we implemented this simpler approach.

Clustering of the genes using RNA-Seq time series
We applied the multi-replicate clustering methods MMDPGP, MOHGP [6] and DPGP [2] on the RNA-Seq data to group the genes according to their transcription dynamics.Even though we used the same values for initializing the parameters of the methods we obtained very different results.In the case of MMDPGP we acquired 26 different clusters, MOHGP method returned 116 clusters where most of the clusters have a small number of genes, while DPGP yielded only 13 clusters.The reason for this is because in the case of MOHGP during each iteration of the Gibbs sampler the likelihood of each gene to belong to a new cluster is higher.This is because there is only one GP for each cluster which captures the variation between all replicates, and given a not negligible difference between replicates, it often results in a poor fit for the gene under consideration, thus the gene is assigned to a new cluster.In the case of MMDPGP however, every cluster has a separate GP for each replicate and the biological replicates are compared on how well they fit in the replicate specific GP at each cluster, thus resulting in a better fit at the cluster whose genes express similarly on the same replicate.DPGP on the other hand ignores the variation between replicates and performs clustering using the replicates' means.When performing the mean of the replicates, the replicate's gene specific nuances vanish resulting in a larger number of similar time series, hence the smaller number of clusters.Fig 1 shows the GP models for three clusters generated by the methods.We can see how in the case of MMDPGP the GPs of different replicates differ because even though the replicates are similar they have their own specific characteristics depending on natural variation between the different replicates.When exploited, this natural variation can be a useful source of additional knowledge during the clustering procedure.

Biological significance analysis
To understand the biological relevance of our results, we carried out a gene enrichment analysis using COG annotations assigned to each gene (see "Data pre-processing" subsection).We chose this method because the standard gene ontology (GO) enrichment analysis, was not available for C. botulinum that is not annotated in the GO May 29, 2024 10/17

Leave-one-out predictive likelihood
To estimate the quality of the fit of the genes in their corresponding cluster, we performed a leave-one-out predictive likelihood analysis as described in [34].This is a standard measure for performing model selection with GPs.In order to do it, we refit the data of each cluster to its respective GP model by removing one time point from each gene/operon at a time.At each iteration, we calculated the log-likelihood of the model to predict the observed gene expression values at the time point that was removed.We did this for all time points (i.e., 13 time points) and we divided the numbers with the total number of single observations, to express the results in natural logarithm units per observation (nat/obs.).This normalisation allows for a fair comparison, given that in the case of DPGP the number of observations is 3 times smaller because they use only the mean of the 3 replicates as observation data.Fig 3 shows the results of the leave-one-out predictive likelihood analysis.We can see that MOHGP performs slightly better in the analysis by having a higher predictive likelihood for the time points overall compared to MMDPGP.This result can be explained by the fact that in the case of MOHGP most of the clusters have a very small number of genes which results in a better modeling of the data, but at the expense of producing less biologically meaningful cluster as indicated by the gene enrichment analysis.

Performance on synthetic data with different amount of noise
We tested the classification accuracy of our methods by generating time series data from 6 different clusters.Each cluster was represented by a GP with a different mean function, and same kernel parameters (l, σ f ) and observation noise σ 2 n .We sampled 10 different time series from each GP with mean m(t) ∈ {sin(x/2), − sin(x/2), sin(x/4), − sin(x/4), cos(x/2), − cos(x/2)} and introduced 2 additional replicates.For the first 5 clusters the replicates were generate by adding random noise σ.The additional replicates from the last cluster were obtained by horizontally shifting to the right (e.g., suggesting genes which are expressed earlier or later in different replicates) the initial samples by ∆ and 2∆ respectively.
We simulated the process multiple times for different values of the GP model variance σ 2 f ∈ {0, 0.  [6] and DPGP [2]) on the left out time points for each observation.Each point represents the sum of log-likelihoods of the observed gene expression values at the removed time point over all the posterior cluster models divided by the number of observations.In the legend we can also see the sums over all the removed time points, for each model.σ ∈ {0, 0.2, 0.4, 0.5}, and length of phase shift ∆ ∈ {0, 0.3, 0.5}.The length scale and model noise were set to default values (i.e., (l, σ 2 n ) = (2, 1)).Each simulation resulted in clusters' data with different attributes that are determined by the triple (σ 2 f , σ, ∆).
Then, we run the clustering methods with each cohort of generated data.In order to evaluate the results obtained by the methods we used a measure of purity for the clusters as defined in [35].Purity assesses clustering quality by evaluating how well it avoids mixing different types of objects within the same cluster.This is a desired property in our context if we want to generate hypothesis about the functionality of unknown genes based on genes which we know in the same cluster.

Discussion
Dirichlet processes (DP) and Gaussian processes (GP) can be combined and used to cluster functional data without making strong assumptions such as the number of clusters to group the data.In this paper we proposed a model-based method for the purpose of clustering gene expression time series with multiple replicates.Our method uses a different GP to represent each replicate inside every cluster, for this reason we call it multiple models Dirichlet process Gaussian process (MMDPGP).We compared our method with two similar approaches, MOHGP, proposed by Henseman et al. [6], and DPGP, proposed by McDowell et al. [2].We described the three methods and their generative model assumptions in detail.Python implementations were also provided for MMDPGP and MOHGP.One clear advantage of MMDPGP over MOHGP is its lower computational complexity which makes it useful for handling larger amount of data (e.g., data consisting of dense time series of thousands of genes, with many replicates).
compared to MMDPGP, it ignores the variation between replicates, which can be a useful source of information for clustering the genes in a biologically meaningful way.
We applied the clustering methods on a novel RNA-Seq time series data set consisting of three experimental replicates.We performed a COG enrichment analysis on the clusters produced by each method and showed that in the case of MMDPGP a higher number of significant enrichments (i.e., p < 0.05) are detected.Additionally for higher levels of significance (i.e., p < 0.01) not only the number of significant enriched classes is higher in the case of MMDPGP but all the significant enrichments detected by the other methods were also detected by MMDPGP.
The leave-one-out predictive likelihood analysis showed that MOHGP performed better than MMDPGP, but that is because the size of the clusters returned by MOHGP are very small which can cause an overfit to the GP of the clusters, thus yielding higher predictive likelihood overall.
Finally, to demonstrate the accuracy of MMDPGP we generated synthetic data from six different clusters by introducing different levels of noise and tested the performance of each method by measuring the cluster purity of the results.MMDPGP outperformed the other methods by scoring a significantly higher cluster purity as shown in Fig 4 .We believe that MMDPGP can be useful for clustering gene expression time series data with multiple replicates because it effectively captures the information provided by experimental replicates and uses it to provide a meaningful discrimination between gene expression profiles.By modeling each replicate with a separate GP it provides more details for each individual replicate which can be useful for clustering in the cases when replicates differ, for instance because of different experimental conditions.
express the model conditioned on m

κ
(t).In the marginalized model the expression values of the same gene from different replicate measurements have a zero covariance, as they come from different GPs, we can reduce the computations by building smaller May 29, 2024 7/17 covariance matrices with the observations from each replicate separately.If we denote by µ

Fig 1 .Fig 2 .
Fig 1.Comparison between 3 clusters generated by MMDPGP (left column plots), MOHGP[6] (middle column plots), and DPGP[2] (right column plots).The three replicates in the MOHGP's cluster and DPGP's cluster are modeled by one GP (top middle column plot and top right column plot).The three replicates in the MMDPGP cluster are modeled by three different GPs (top left plot).The replicate-specific trajectories of the genes in each cluster are plotted separately on top of their respective GP.In the case of MOHGP and DPGP they are plotted on top of the same GP model which characterizes the respective cluster.In the case of of MMDPGP cluster the trajectories are plotted on top of their respective replicate specific GP.The x-axis is drawn to indicate the real time of the RNA-Seq measurements.
Fig 4 shows the purity values of the clusters obtained by running MMDPGP, MOHGP and DPGP methods on different cohorts of synthetic data.As it can be seen MMDPGP outperforms the other methods at all cohorts of synthetic data in terms of cluster purity measure.This means that MMDPGP is less likely to put time series which are originally generated by different clusters into the same one.