Reconstructing Spatial Transcriptomics at the Single-cell Resolution with BayesDeep

Spatially resolved transcriptomics (SRT) techniques have revolutionized the characterization of molecular profiles while preserving spatial and morphological context. However, most next-generation sequencing-based SRT techniques are limited to measuring gene expression in a confined array of spots, capturing only a fraction of the spatial domain. Typically, these spots encompass gene expression from a few to hundreds of cells, underscoring a critical need for more detailed, single-cell resolution SRT data to enhance our understanding of biological functions within the tissue context. Addressing this challenge, we introduce BayesDeep, a novel Bayesian hierarchical model that leverages cellular morphological data from histology images, commonly paired with SRT data, to reconstruct SRT data at the single-cell resolution. BayesDeep effectively model count data from SRT studies via a negative binomial regression model. This model incorporates explanatory variables such as cell types and nuclei-shape information for each cell extracted from the paired histology image. A feature selection scheme is integrated to examine the association between the morphological and molecular profiles, thereby improving the model robustness. We applied BayesDeep to two real SRT datasets, successfully demonstrating its capability to reconstruct SRT data at the single-cell resolution. This advancement not only yields new biological insights but also significantly enhances various downstream analyses, such as pseudotime and cell-cell communication.


S1. Simulated Data Generative Model
In our simulation study, we aimed to validate the accuracy of our model.We selected N = 500 adjacent spots from the human breast cancer 10x Visium data (see Figure 2A in the manuscript) with known spot locations T Y .The cell locations T X and the covariate matrix X, including cell types and nuclei shape features, were obtained from the nuclei identification results through HD-staining on the paired histology image.As detailed in Methods Section of the manuscript, in real SRT data, typically only a subset of covariates correlate with gene expression for most genes.To replicate this scenario, we generated coefficients with high sparsity.Specifically, for cell-type-related explanatory variables, we assumed all corresponding coefficients to be non-zero and normally distributed as β lj ∼ N(1, 1).In contrast, for nuclei-shape-related explanatory variables, we produced highly sparse coefficients as follows: Consequently, the coefficient matrix B exhibited 90% sparsity.We then generated singlecell-resolution expression levels in line with our model assumption θ j = exp(Xβ j ).Spotresolution relative expression levels for gene j (i.e., λ j ) were calculated as λ j = (Gθ j )/(G1), where G is derived from spot and cell locations (i.e., T Y and T X ), and the denominator G1 represents the number of cells covered by each spot.To simulate observed spot-resolution gene expression counts, we followed the process outlined in Sun et al. (2020) and Jiang et al. (2022).Specifically, we generated the size factor s i ∼ Unif(0.5, 1.5) for i = 1, . . ., N and dispersion parameter φ j ∼ Exp(0.1) for j = 1, . . ., P , and subsequently, each read count y ij ∼ NB(s i λ ij , φ j ).This data generation process was repeated for P = 300 genes.Finally, with the generated Y and the observed X and G, we examined our model's ability in reconstructing the single-cell-resolution molecular profile Θ.

S2. Details of the MCMC Algorithms
The model parameter space (φ, B, Γ) consists of 1) the NB dispersion parameters φ that accounts for the over-dispersion commonly observed in gene expression count data, 2) the coefficient matrix B that quantifies the relationship between gene expression as measured in SRT data and the morphological features extracted from the paired histology image, and 3) the selection matrix Γ that indicates the significant association in the coefficient matrix B. The full likelihood of the model is as follows: (1) With the prior specifications detailed in Methods Section of the manuscript, we give the full posterior distribution as follows: MCMC algorithm was conducted for the posterior sampling.Specifically, we performed the following steps sequentially at each MCMC iteration after a random initialization.
Note that the proposal density ratio equals to 1 for this random walk Metropolis update on Joint update of covariate coefficient B and the selection indicator Γ: We updated each β lj and γ lj for l = 1, . . ., L and j = 1, . . ., P , sequentially by using an independent Metropolis-Hasting algorithm.A between-model step was implemented first to jointly update β lj and γ lj .We used the add-delete algorithm, where we changed the binary value of γ lj to 1 − γ lj each time.We repeated this step multiple times to ensure validation of the feature selection.

Update of covariate coefficient B:
We performed a within-model step to update each β lj that corresponds to γ lj = 1 to improve MCMC mixing.We first proposed a new β * lj from the normal distribution N(β lj , (σ β /2) 2 ) and then accepted the proposed value with probability min(1, r β ), where .
Note that the proposal density ratio equals to 1 for this random walk Metropolis update on β lj .

S3. Implementation Details of Competing Methods
We adapted the TESLA workflow described in their online tutorial (https://github.com/jianhuupenn/TESLA/blob/main/tutorial/tutorial.md).Specifically, we normalized the gene expression measurements into log scale.For the histology image, we detected tissue contour using cv2.All parameters were kept at the defaults or as the same as in the tutorial.
We then imputed the gene expression on all pixels within the contour.The predicted gene expression at each cell was set to be the imputed gene expression on the pixel where the cell is located.
Gaussian process (GP) was implemented with GaussianProcessRegressor from sklearn.gaussian process module in Python.Before fitting the GP model, the observed read counts y j for gene j were normalized to ỹj = log(y j /s + 1), where s = (s 1 , . . ., s N ) was the estimated size factor of spots and was computed as s i ∝ P j=1 y ij .The covariance function of the GP was specified using the radial basis function (RBF) kernel, with the length scale of the kernel set to be the radius of spots.Additionally, to account for noise, a WhiteKernel component was incorporated into the kernel, allowing for the estimation of the global noise level from the data.The GP model was then trained utilizing the location information at spot resolution as features and the normalized gene expression as response variable.For predicting gene expression at the single-cell resolution, we input the location information of all cells and employed the mean of the output predictive distribution as the predicted gene expression value.
The spot location with each row (t Y i1 , t Y i2 ) being the x and y coordinates of spot i X = [x ml ] M ×L x ml ∈ R The covariate matrix with each entry x ml representing a measurement for explanatory variable l observed for cell m T The cell location with each row (t X i1 , t X i2 ) being the x and The variance parameter of the slab component in the spike-and-slab prior for each β lj πγ (0, 1) The parameter in the Bernoulli prior for each γ lj

Figure S1 :
Figure S1: Geometric representations of a barcoded spot (in red) and its expanded region (in green) on the spatial transcriptomics (ST) and the improved 10x Visium platforms.For ST, the spot diameter is 55µm with a center-to-center distance of 100µm between two adjacent spots.For 10x Visium, these measures are 100µm and 200µm, respectively.The percentage of area covered by ST barcoded spots can be approximated by the ratio of the red circle area to the green square, calculated as π×( 100µm 2 ) 2

Figure S2 :
Figure S2: Overview of model validation for the simulated data: Evaluation results in terms of the root mean square error (RMSE) between the actual and estimated covariate coefficient β lj 's for BayesDeep with and without regularization, respectively.

Figure S3 :
Figure S3: Overview of model validation for the human prostate cancer 10x Visium data, including the validation settings and evaluation results in terms of the Pearson correlation coefficients ρ between the actual and predicted gene expression counts (y ij 's vs. ŷij 's) for BayesDeep, TESLA, and Gaussian Process (GP), respectively.

Figure S4 :
Figure S4: Downstream analysis on the human breast cancer 10x Visium data: A. BayesDeep-generated single-cell-resolution expression of the breast cancer stem cell markers CD44 depicted on the UMAP.B. BayesDeep-generated single-cell-resolution expression of the breast cancer stem cell markers CD24 depicted on the UMAP.C. Pseudotime analysis on the UMAP.

Figure S5 :
Figure S5: Downstream analysis on the human prostate cancer 10x Visium data: A. BayesDeep-generated single-cell-resolution expression of the prostate cancer stem cell markers ITGA6 depicted on the UMAP.B. BayesDeep-generated single-cell-resolution expression of the prostate cancer stem cell markers ALCAM depicted on the UMAP.C. Pseudotime analysis on the UMAP.
spot-cell spatial relationship matrix with g im = 1 indicating that cell m locates within spot i and g im = 0 otherwiseParameters φ = [φ j ] P ×1 φ j ∈ R + The gene-specific negative binomial dispersion parameter Λ = [λ ij ] N ×P λ ij ∈ R +The relative gene expression at the spot resolution with each entry λ ij being the relative gene expression for genej at spot i Θ = [θ mj ] M ×P θ mj ∈ R + The relativegene expression at the single-cell resolution with each entry θ mj being the relative gene expression for gene j at cell m B = [β lj ] L×P β lj ∈ R The covariate coefficient matrix with each entry β lj signifying the association between explanatory variable l and gene j Γ = [γ lj ] L×P γ lj ∈ {0, 1} The selection matrix with each entry γ lj = 1 indicating the corresponding covariate coefficient β lj = 0 while γ lj = 0 indicating β lj = 0 Hyper-a φ {0} ∪ R + The shape parameter in the gamma prior for each φ j parameters b φ {0} ∪ R + The rate parameter in the gamma prior for each φ j σ 2 β R +

Table S2 :
A list of key notations of the proposed BayesDeep.The gene expression measurement table with ench entry y ij being the read count for gene j observed at spot i