## Abstract

Characterizing the tumor microenvironment is crucial in order to improve responsiveness to immunotherapy and develop new therapeutic strategies. The fraction of different cell-types in the tumor microenvironment can be estimated based on transcriptomic profiling of bulk tumor data via deconvolution algorithms. One class of such algorithms, known as reference-based, rely on a reference signature containing gene expression data for various cell-types. The limitation of these methods is that such a signature is derived from the gene expression of pure cell-types, which might not be consistent with the transcriptomic profiling in solid tumors. On the other hand, reference-free methods usually require only a set of cell-specific markers to perform deconvolution; however, once the different components have been estimated from the data, their labeling can be problematic. To overcome these limitations, we propose BayesDeBulk - a new reference-free Bayesian method for bulk deconvolution based on gene expression data. Given a list of markers expressed in each cell-type (cell-specific markers), a repulsive prior is placed on the mean of gene expression in different cell-types to ensure that cell-specific markers are upregulated in a particular component. Contrary to existing reference-free methods, the labeling of different components is decided a priori through a repulsive prior. Furthermore, the advantage over reference-based algorithms is that the cell fractions as well as the gene expression of different cells are estimated from the data, simultaneously. Given its flexibility, BayesDeBulk can be utilized to perform bulk deconvolution beyond transcriptomic data, based on other data types such as proteomic profiles or the integration of both transcriptomic and proteomic profiles.

## 1 Introduction

Solid tumors are composed of a variety of cell-types including immune and stromal cells. Quantifying the proportion of different cell-types in the tumor microenvironment is crucial in order to capture patient heterogeneity and develop better therapeutic targets for precision medicine. In the last decade, different algorithms have been proposed for the estimation of the tumor microenvironment from bulk data. Some algorithms, known as reference-based, require gene expression of purified cells as prior information [2, 8, 16]. However, problems might arise when the gene expression of different cell-types in solid tumors are not consistent with this prior knowledge. In addition, this prior information might not be appropriate when performing the deconvolution based on other data types such as proteomic profiles.

Some algorithms have tried to overcome this lack of flexibility by proposing a semi-reference approach [3, 10, 14]. Recently, Tai et al, [14] proposed a semi-reference Bayesian method which jointly models gene expression from purified cells and that from bulk data through a hierarchical model. This model is more flexible than reference-based methods since gene expression in different cell-types is inferred from the data; however, it relies on the assumption that the mean expression of a particular cell in the reference data and in bulk data are the same. Aran et al (2017) [1] proposed a flexible tool for bulk deconvolution based on transcriptomic data, which requires only a list of markers expressed in each cell-type as prior information. However, this algorithm does not provide an estimate of the gene expression of different cell types from the data. These estimates are particularly useful in order to perform differential expression analyses between tumor and adjacent normal tissues while accounting for immune and stromal infiltration.

To this end, there are many reference-free algorithms which can infer both cell-type proportions and gene expression in different cell-types [5–7, 13]. These algorithms represent a more flexible alternative to reference-based algorithms; however, once estimated, the interpretation and the labeling of different components might be problematic. Recently, Tang et al [15] proposed an algorithm based on non-negative matrix factorization. This method recovers the identifiability and the labeling of different components using a penalized regression, in which markers expected to be less expressed in a particular cell type shrink towards zero. For this purpose, for each cell-type, markers are divided into three categories: not expressed, expressed and highly expressed. However, marker stratification into such categories might not be easy to achieve in practice.

To overcome these limitations, we propose BayesDeBulk - a new flexible Bayesian method for bulk deconvolution. Bayesian inference is very appealing in this framework since prior information for different cell-types can be flexibly incorporated through the prior. Given a list of markers expressed in a particular cell-type (cell-specific markers), a repulsive prior is placed on the mean of gene expression in different cell-types to ensure that cell-specific markers are upregulated in a particular component. Repulsive classes of priors have been introduced by Petralia et al [9]; and recently extended to different applications [11, 12, 17, 18]. Contrary to existing reference-free methods, the labeling of different components is specified a priori through a repulsive prior. The cell fraction parameter is instead modeled through a spike-and-slab prior [4] in order to induce sparsity and identify cells which are not present in the tumor tissue. Contrary to reference-based algorithms, our framework estimates different cell-type fractions and the mean of gene expression in different cell-types from the data, simultaneously. Given its flexibility, BayesDeBulk can be used to perform the deconvolution based on other data types such as methylation data and proteomic profiles or the integration of multi-omic data. The performance of our model is evaluated using extensive synthetic data and real data examples.

## 2 Method

### 2.1 Bulk Deconvolution

Since the expression of bulk tumor data is the average across different cells in the tumor microenvironment, the expression of gene *j* for patient *i*, i.e., *y _{i,j}*, can be modeled as a Gaussian distribution with mean parameter being the weighted average between the expression of gene

*j*in different cell-types. Mathematically,

*y*is modeled as with

_{i,j}*K*being the total number of cell-types,

*π*being the fraction of the

_{i,k}*k*-th cell-type for sample

*i*,

*μ*being the expression of gene

_{k,j}*j*for the

*k*-th cell-type and

*σ*the variance of the

_{j}*j*-th gene. Reference-based models would consider

*μ*as fixed with measurements derived from existing pure cell transcriptomic data [2, 8, 16]; while reference-free models would estimate mean parameters {

_{k,j}*μ*} from the bulk data. A Bayesian model would specify prior information for all parameters in the model; with conjugate priors being Gaussian distributions for and {

_{k,j}*μ*} and inverse-gamma distributions for . However, this model would not be identifiable without further constraints on the parameter space. To overcome this problem we propose a Bayesian model where identifiability is recovered via a repulsive prior specified on the mean parameters [9].

_{k,j}### 2.2 Bayesian model based on repulsive prior

Let us assume that for each *k*-th cell-type, there is a set *I _{k}* of genes whose expression is upregulated in the

*k*-th cell-type compared to all others. We will use a flexible repulsive prior [9] in order to ensure that genes in set

*I*will have a “larger” mean in the

_{k}*k*-th cell-type compared to other cell-types. Let

*be a*

**μ**_{k}*p*dimensional vector containing the gene expression of

*p*genes in the

*k*th cell-type. Then, (

*μ*_{1}, …,

*) is jointly modeled through the following multivariate prior: with*

**μ**_{K}*h*(−) being a repulsive function defined as with

*τ*> 0 and

*η*> 0. This function is an extension of the repulsive function introduced by Petralia et al [9], and it approaches zero as the distance between mean parameters goes to zero and the upregulation of genes belonging to set

*I*in the

_{s}*s*-th cell is not satisfied. According to this function, genes contained in set

*I*will have a mean value greater in component

_{k}*k*-th compared to all other components. It is important to note that only genes contained in set will be assigned a repulsive prior; other genes will have a standard normal prior. This is sufficient to recover identifiability of the model and will reduce substantially the computational burden. Prior knowledge on markers upregulated in each cell-type can be leveraged from existing databases and single cell RNA data. Instead of requiring a set of markers to be upregulated in one cell-type compared to all other cell-types; the user might specify this requirement for each pair of cells. For instance, assume that

*I*is the set of genes upregulated in the

_{s>k}*s*-th cell-type compared to the

*k*-th cell-type. In this case, the repulsive prior can be easily modified to incorporate this information in the following way:

To facilitate computation, we will not require to sum to 1. However, we will require these parameter to be defined on the unit interval [0, 1]. As prior specification, we will use a spike-and-slab prior [4] defined on the unit interval, i.e., *π _{i,k}* ~

*w*

_{k}N_{[0,1]}(0, 0.0001) + (1 −

*w*)

_{k}*N*

_{[0,1]}(0,

*γ*) with

_{k}*w*~

_{k}*Beta*(1, 1) and

*γ*~ Inverse-Gamma(

_{k}*a*). The spike component concentrates its mass at values close to zero, shrinking small effects to zero, and therefore inducing sparsity in cell fractions estimates. The percentage of zero values (i.e.,

_{γ}, b_{γ}*w*) will vary across different cell-types. We expect that some cell-types will be more abundant (i.e., different from zeros) than others in a particular tissue. For instance, T cells will be more likely present in kidney or lung tissues rather than brain tissues. For the variance components , standard inverse-gamma priors will be utilized. Figure 1 provides a summary of the proposed model.

_{k}### 2.3 Full conditionals and posterior computation

Following Petralia et al [9], a latent variable *ρ* will be introduced to facilitate the sampling from the repulsive prior. This latent variable will be jointly modeled
with ** μ** through the following multivariate density:

A set of additional latent variables {*Z _{i,k}*} will be introduced in order to facilitate the sampling from the spike-and-slab prior placed on {

*π*}. In particular,

_{i,k}*Z*will be equal to 1 if

_{i,k}*π*will be sampled from the “spike” component, i.e.,

_{i,k}*π*~

_{i,k}*N*

_{[0,1]}(0, 0.0001); while equal to 0 if

*π*will be sampled from the “slab” component, i.e.,

_{i,k}*π*~

_{i,k}*N*

_{[0,1]}(0,

*γ*). Let 1(

_{k}*A*) be an indicator function equal to 1 if A is satisfied and 0 otherwise. The Gibbs sampler can be summarized in the following steps.

Step 1 Sample mean parameter *μ _{k,j}* from a truncated normal distribution:
with

*M*=

_{i,j}*y*− ∑

_{i,j}_{s≠k}

*μ*and

_{s,j}π_{i,s}*S*being defined as the intersection across all constraints involving

_{k,j}*μ*. This set is defined in section 1 of the supplementary material.

_{k,j}Step 2 Sample *Z _{i,k}* from

Step 3 Sample *π _{i,k}* from a truncated univariate normal defined as:
with

*T*being defined as

_{i,k,j}*y*− ∑

_{i,j}_{s≠k}

*μ*and

_{s,j}π_{i,s}*η*=

_{k}*γ*if

_{k}*ℓ*= 0 and

*η*= 0.0001 if

_{k}*ℓ*= 1.

Step 4 Sample *w _{k}* from

*Beta*(1 + Σ

_{i}1(

*Z*= 1), 1 + Σ

_{i,k}_{i}1(

*Z*= 0))

_{i,k}Step 5 Sample *γ _{k}* from:

Step 6 Sample *σj* from:

Step 7 Sample *ρ* from a uniform distribution

Detailed information on how full-conditionals were derived is contained in section 1 of supplementary material.

## 3 Synthetic Data

### 3.1 Data Generation

The performance of BayesDeBulk in estimating cell-type fractions and the gene expression in different cells was evaluated based on extensive synthetic data. Let *p* be the total number of genes, *n* the total number of samples and *K* the number of cell-types. Let *I _{k}* be the set containing 20 cell-specific markers for the

*k*-th cell; which were randomly sampled from the full list of genes. The mean of cell-specific markers for a particular cell

*k*, i.e.,

*μ*with

_{k,j}*j*∈

*I*, was drawn from a Gaussian distribution with mean uniformly sampled from the range [1, 3] and standard deviation 0.5; while the mean of other markers, i.e.,

_{k}*μ*with

_{k,j}*j*∉

*I*, from a Gaussian distribution centered on zero and standard deviation 0.5. The fraction of different cell-types, i.e., (

_{k}*π*

_{1,i}, …

*π*), was randomly generated from a Dirichlet distribution with parameter 1. Given these parameters, mixed data for the

_{K,i}*i*-th sample was generated as follows: with

*~*

**ϵ**_{i}*N*(0,

*νI*) and

*~*

**V**_{k,i}*N*(

*).*

**μ**_{k}, σI### 3.2 Results

BayesDeBulk was compared with Cibersort [8] based on different simulation scenarios with varying numbers of cells and genes; i.e., (*K, p, n*) = (10, 200, 100) and (*K, p, n*) = (20, 400, 100), and variance levels *ν* and *σ*. For each synthetic scenario, 10 replicate datasets were generated and the performance of the two models was evaluated based on two metrics: Pearson’s correlation and mean squared error (MSE) between estimated fractions and true fractions. For each replicate, Bayes-DeBulk was estimated considering 10000 Marcov Chain Monte Carlo (MCMC) iterations; with the estimated fractions being the mean across iterations after discarding a burn-in of 1000. BayesDeBulk was implemented (i) assuming that all cell-specific markers are known a priori (BayesDeBulk 100) and (ii) only 50% of cell-specific markers are known a priori (BayesDeBulk 50). This second scenario is more representative of real world applications, where only a proportion of cell-specific markers is usually known. Contrary to BayesDeBulk, Cibersort requires as input a signature matrix containing the mean of different markers for different cell-types. In order to make a fair comparison, a perturbed version of the original signature matrix was considered as input in Cibersort based inference. Specifically, the original signature matrix was perturbed following two approaches: (i) preserving the upregulation of key cell-specific markers (Cibersort 100), and (ii) preserving only 50% of markers upregulation. The scatterplot between true and perturbed signature matrices can be found in the supplementary material (section 2.1, Supplementary Figure 1).

As shown in Figure 2, BayesDeBulk resulted in a higher Pearson’s correlation for different synthetic data scenarios. In particular, a median correlation above 0.90 was observed for BayesDeBulk for all simulation scenarios involving *K* = 10 components; while Cibersort resulted in a median correlation lower than 0.70 for higher noise levels. As expected, the performance of both models decreased as more components were incorporated into the model. Overall, we observe that Cibersort is more sensitive to the prior knowledge incorporated in the model; in fact its performance substantially decreases when only 50% of the markers are known a priori. This is due to the lack of flexibility of Cibersort, which requires as input a signature matrix containing the mean levels of different markers for different cells. Indeed, the advantage of our proposed Bayesian framework is the estimation of the expression of different markers for different cell-types. Section 2.2 of supplementary material shows the performance of BayesDeBulk in estimating the mean of gene expression for different components. As expected, higher noise levels result in lower performance in terms of both correlation and MSE (Supplementary Figures 2, 3). The median Pearson’s correlation between estimated and true values across replicates was above 0.80 for the simulations involving 10 cell types; including when only 50% of cell-specific markers are known a priori. Although the median correlation decreases substantially when the number of components increases to *K* = 20, it remains above 0.50 for different simulation scenarios.

## 4 Validation based on flow cytometry

In this section, the performance of BayesDeBulk is compared with Cibersort [8] and xCell [1] based on transcriptomic data from peripheral blood mononuclear cells from 20 adults who received influenza immunization [8]. For inference, BayesDe-Bulk considered the same set of cell-types used in Cibersort; however, both signatures from Cibersort and xCell were considered as prior information. Detailed information on how cell-type specific markers were identified based on both signatures can be found in Section 3 of supplementary material. BayesDeBulk model was estimated considering 3000 MCMC iterations; with the estimated fractions derived as the mean across iterations after discarding a burn-in of 1000. Figure 3 shows the Pearson’s correlation between flow-cytometry estimates and estimates derived via different algorithms. As illustrated, BayesDeBulk outperformed both Cibersort and xCell in the estimation of gamma delta T-cells and monocytes. In addition, BayesDeBulk performed better than xCell in the estimation of NK cells and CD8 T cells. For 5 out of 7 cells, BayesDeBulk resulted in a correlation higher than 0.5; compared to 6 out of 7 for Cibersort and 4 out of 7 for xCell. xCell resulted in estimates equal to zero for gamma delta T cells and NK cells.

## 5 Conclusion

We introduce BayesDeBulk, a new Bayesian method for the deconvolution of bulk tumor data. BayesDeBulk allows the simultaneous estimation of both cell fractions and gene expression for different cell-types. To perform the deconvolution, Bayes-DeBulk requires a set of genes expressed in each cell-type, which can be obtained from existing transcriptomic profiles of pure cells. Bulk RNA data is modeled via a Gaussian distribution with mean being the weighted average of expression in different cell-types. Given a list of markers expressed in a particular cell-type, a repulsive prior is placed on the mean of gene expression in different cell-types to ensure that cell-specific markers are upregulated in a particular component. This prior specification facilitates the identification and the labeling of the components contained in the mean parameter; which is a common problem of reference-free methods.

Contrary to reference-based methods, our framework estimates different cell-type fractions and the mean of gene expression in different cell-types from the data, simultaneously. Reference-based algorithms often rely on the assumption that the transcriptomic profiling of different immune/stromal cells in solid tumor is similar to that of the reference data derived from pure cells. Violation of this assumption might lead to poor performance in the estimation of cell fractions. On the other hand, BayesDeBulk does not need to rely on such an assumption since it estimates the transcriptomic profiling of different cells directly from the data.

In addition, the estimation of transcriptomic profiling for different cells is very important in order to perform differential expression analyses between adjacent normal and tumor tissues while accounting for tumor purity. For example, one problem that researchers encounter when performing differential expression analyses between tumor and adjacent normal tissues is that some immune genes might be detected as differentially expressed between tumor and adjacent normal tissues driven by the higher immune infiltration in tumor. BayesDeBulk can be used to estimate the transcriptomic profile of tumor cells by adding an extra component. Then, the estimated profiling of tumor cells might be used in order to identify genes differentially expressed between specifically tumor-cells and adjacent normal tissues.

Given its flexibility, BayesDeBulk can be utilized to characterize the tumor microenvironment based on other data types such as methylation or proteomic profiling. In addition, the algorithm can be easily utilized for a multi-omic based deconvolution. In this case, each data type can be modeled via a BayesDeBulk model, with different data-specific models sharing the same set of cell fraction parameters. This multi-omic framework would allow the estimation of cell fractions based on multi-omic data as well as multi-omic measurements of different markers across different cell-types.