SPACE: Spatially variable gene clustering adjusting for cell type effect for improved spatial domain detection

Recent advances in spatial transcriptomics have significantly deepened our understanding of biology. A primary focus has been identifying spatially variable genes (SVGs) which are crucial for downstream tasks like spatial domain detection. Traditional methods often use all or a set number of top SVGs for this purpose. However, in diverse datasets with many SVGs, this approach may not ensure accurate results. Instead, grouping SVGs by expression patterns and using all SVG groups in downstream analysis can improve accuracy. Furthermore, classifying SVGs in this manner is akin to identifying cell type marker genes, offering valuable biological insights. The challenge lies in accurately categorizing SVGs into relevant clusters, aggravated by the absence of prior knowledge regarding the number and spectrum of spatial gene patterns. Addressing this challenge, we propose SPACE, SPatially variable gene clustering Adjusting for Cell type Effect, a framework that classifies SVGs based on their spatial patterns by adjusting for confounding effects caused by shared cell types, to improve spatial domain detection. This method does not require prior knowledge of gene cluster numbers, spatial patterns, or cell type information. Our comprehensive simulations and real data analyses demonstrate that SPACE is an efficient and promising tool for spatial transcriptomics analysis.

dataset.We begin with a filtered gene expression count matrix and a location matrix where spots are annotated.The tissue region contains N spots where gene expression measurements are measured for each of the m genes.
Step 1: Randomly select m 1 genes from the original dataset to be converted into new SVGs in the synthetic dataset.The remaining m − m 1 genes will serve as noise genes with no spatial pattern.
Step 2: For each of the m 1 genes in the original dataset, sort and arrange the gene expression count values according to the reference domain structure.For the remaining genes, sort and arrange the count values randomly in the tissue region. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .S3 The t-SNE plots [1] illustrate the separation of domain-specific SVGs by SPACE across 10 synthetic datasets. . . . . . . . . . . . . . . . . . . . .S4 The performance comparison between A)nnSVG [2] and B)SPACE is conducted for the simulation setup outlined in section 3.1.1 of the main manuscript.The gene groups are: Independent: Uncorrelated Gene group for genes without any spatial pattern(g1-g10), Correlated: Correlated Gene group for genes without any spatial pattern(g11-g20).Pattern 1-3: Correlated Gene group for genes with spatial pattern 1-3 (g21-g30,g31-g40,g41-g50 The Gaussian process (GP) regression model which models the normalized gene expression y for a given gene using the following multivariate normal model: where the covariance term is decomposed into a spatial and a non-spatial part, where δI represents the non-spatial part and σ 2 s K is the spatial covariance matrix, whose (i, j) th element in the kernel matrix K denotes the spatial similarity between the ith and jth spot calculated based on the corresponding coordinates s i and s j .The choice of the kernel function plays a very important role in detecting the spatial correlation present in the gene expressions.X N ×k represents the covariate matrix, while β k×1 denotes the array of corresponding coefficients.
As we mentioned in the main text, testing if a gene is a SVG is equivalent to testing The null hypothesis H 0 : σ 2 s =0 can be tested using the variance-component score test which is the locally most powerful test [3].The variance-covariance score statistic is: where β is the MLEs under the null model.Under the null hypothesis, the score statistic Q follows a mixture of chi-square distributions [4], which can be closely approximated with the computationally efficient Davies' method [5].More details about the test is provided in the next subsection.
In this manuscript, we utilize part of the code provided along with the SKAT paper [4] which uses the same score test for the purpose of rare-variant association testing in genetic data.
Kernel functions A kernel function is defined as a function K : X × X → R, where the kernel matrix K = (k i,i ′ ) n i,i ′ =1 is symmetric and positive semidefinite with k i,i ′ = k(s i , s i ′ ).In this setting, k(s i , s i ′ ) is a measure of similarity between the ith and the i ′ th subject.There are a variety of kernel functions to choose from, and the most simple one is the Linear kernel.The other useful kernels are Polynomial kernel, the Gaussian kernel and the cosine kernel.The functional forms of these kernels are summarized below: , where c,d are the free parameters.

Choices of kernel functions:
We must define the kernel function in order to proceed with the hypothesis testing.As it is unknown which kernel will be best for the test, we employ the score test to evaluate the null hypothesis across various kernel functions with distinct kernel parameters.Gaussian and cosine kernels are typically effective in capturing spatial patterns.Following the method outlined in the SPARK paper [6], we compute five different length scale parameter values for the Gaussian Kernel and five different periodic parameter values for the cosine kernel.We conduct the test across ten different kernels and aggregate the resulting p-values using the Cauchy combination rule [7].

SPACE algorithm
Our model is built upon the Gaussian process model.Thus, we require normalized count matrix data with m rows (genes) and N columns (spots).We also need the spatial location matrix L with N rows, 2 columns (X and Y coordinates of spots).
2).If |S j | = 0, then the jth gene has an unique spatial pattern.If |S j | <= 3, then fit all genes in S j as the covariates in model (2) in the main text.If |S j | > 3, then get the k j PCs of genes in S j and fit them as covariates in model (2) in the main text.
3).Using the model (2) in the main text, conduct a score test to compute the p-value under 10 different kernels following the SPARK idea, then integrate these 10 p-values using the Cauchy combination rule [7] to get the final p-value.The output includes: a) For each SVG j, a list of correlated genes in S j ; and b) A list of unique SVGs as defined in 2.
Step 3: 1).Based on the output in step 2, a weighted graph structure is created where each SVG is a node.Each SVG j in the output list in 3.a) has a common edge with all its dependent genes specified in list S j in the graph.
2).Clusters are determined from the weighted graph structure in 1) using the Leiden community detection algorithm [11].
3).The unique genes in output list 3.b) not connected with other nodes in the graph structure are allocated to the singleton set.

SVG clustering with community detection algorithm:
In many complex networks, nodes cluster and form relatively dense groups-often called communities, where the nodes within each community are more densely connected to each other than to the rest of the network.Such a modular structure is usually not known beforehand, making the detection of communities a challenging problem.One of the bestknown methods for community detection is called modularity [12].This method tries to maximize the difference between the actual number of edges in a community and the expected number of such edges.Community detection algorithms use various methodologies to partition networks into meaningful clusters, revealing insights into the relationships and interactions within the network.Different community detection algorithms like Louvain and Leiden are widely used in single-cell RNA sequencing (scRNA-seq) analysis for clustering single cells, aiding in the identification of distinct cell types and states.
The modularity is defined by: where m is the total number of edges in the network, e c is the actual number of edges in community c, the expected number of edges can be expressed as K 2 c 2m , where K c is the sum of the degrees of the nodes in community c. γ > 0 is a resolution parameter.Higher resolutions lead to more communities, while lower resolutions lead to fewer communities.The Louvain method [13] is one of the most popular community detection algorithms due to its simplicity and effectiveness, typically operating based on modularity optimization.However, in recent years, some drawbacks of the Louvain algorithm have been identified [11], leading to the increased popularity of the Leiden algorithm.The Leiden method is more robust and accurate, making it better suited for analyzing complex networks where high-quality community detection is crucial.
In step 3 of the SPACE algorithm, we used the Leiden algorithm [11], implemented through the "igraph" R package [14], [15] with the default resolution parameter value γ = 1.

Score statistic and its null distribution
The Gaussian process model presented in Model (1) has many unknown parameters, for example δ, the kernel parameter ρ, σ 2 s .The unknown parameters are estimated simultaneously by treating them as variance components in the linear mixed model and estimating them using the Restricted Maximum Likelihood(REML) model.The REML model was used to reduce bias in the variance components model [16][17] [18].
The score statistic can be written in the form of Q( β, δ) − tr(HK) [20], where H = I − X(X T X) −1 X T , β and δ are the MLEs under the null model and Under H 0 : σ 2 s = 0, the score statistic Q reduces to y T H T KHy and where χ 2 i,1 are independent χ 2 1 random variables and The form of the null distribution of the score statistic in (3) follows from this argument [21].Under H 0 , y ∼ N (Xβ, δI), the covariate effects can be removed by projec-tion, i.e., ỹ = Hy.Now ỹ ∼ N (0, δHH T ) and K 1/2 ỹ ∼ N (0, δK 1/2 HH T K 1/2 ).Let U be the matrix whose columns are the eigenvectors of δK 1/2 HH T K 1/2 and the corresponding eigenvalues are λ1 , • • • , λN .Therefore, U T K 1/2 Hy ∼ N (0, diag( λi )) and Using the fact that the eigenvalues of any matrix A, AA T , A T A are the same and H 2 = H, it can be shown that λi = λ i , the eigen values of δH 1/2 KH 1/2 .

Multiplicity correction
At each stage of SPACE, our model is applied to each of the m genes.To control the false discovery rate (FDR), multiplicity correction is required.We employ the Benjamini-Yekutieli (BY) procedure, known for its effectiveness under arbitrary correlation conditions, to obtain adjusted p-values to claim significant genes [22].

Analysis of human breast cancer dataset
We analyzed another spatial transcriptomics dataset of human breast cancer.We obtained the data from the SPARK [6] GitHub repository at https://github.com/xzhoulab/SPARK-Analysis (specifically the 'Layer2_BC_count_matrix-1.tsv' file).The dataset can also be downloaded from Spatial Transcriptomics Research (http://www.spatialtranscriptomicsresearch.org).This dataset contains 14,789 genes measured across 251 spots.Our preprocessing involved filtering out genes with expression levels below 10% across the array spots and retaining spots with a total read count > 10.Following these criteria, our analysis focused on a final set of 5,262 genes observed across 250 spots within the breast cancer dataset.
Using our framework SPACE, we identified 724 spatially relevant genes in the initial step.In comparison, SPARK [6] detected 290 genes, SPARK-G detected 244 genes, nnSVG [2] detected 592 genes, and SPARK-X [23] detected 901 genes.The 724 genes were subsequently classified into three main clusters(containing 369, 320, and 14 genes respectively) in the second step of SPACE framework, along with a few unique singleton genes.The representative genes from each cluster are visualized in Figure S1.Genes in cluster 1 exhibited higher expression levels in the lower tissue region, while cluster 2 genes were overexpressed in the middle tissue region, and cluster 3 genes highlighted the upper tissue region.Additionally, the unique genes showcased distinct expression patterns in the third row.This figure confirms the superior performance of our SPACE method.We begin with a filtered gene expression count matrix and a location matrix where spots are annotated.The tissue region contains N spots where gene expression measurements are measured for each of the m genes.
Step 1: Randomly select m 1 genes from the original dataset to be converted into new SVGs in the synthetic dataset.The remaining m − m 1 genes will serve as noise genes with no spatial pattern.Step 2: For each of the m 1 genes in the original dataset, sort and arrange the gene expression count values according to the reference domain structure.For the remaining genes, sort and arrange the count values randomly in the tissue region.
Figure S3: The t-SNE plots [1] illustrate the separation of domain-specific SVGs by SPACE across 10 synthetic datasets.
Figure S4: The performance comparison between A)nnSVG [2] and B)SPACE is conducted for the simulation setup outlined in section 3.1.1 of the main manuscript.The gene groups are: Independent: Uncorrelated Gene group for genes without any spatial pattern(g1-g10), Correlated: Correlated Gene group for genes without any spatial pattern(g11-g20).Pattern 1-3: Correlated Gene group for genes with spatial pattern 1-3 (g21-g30,g31-g40,g41-g50). Pattern 4-6: Single gene with spatial pattern 4-6(g51-g53) The spatial pattern strength intensifies within each spatial gene group(pattern1-pattern3) as indicated by the triangles between the plots.The empirical power of the SVG detection step is evaluated for simulated datasets with AR(1) (Left) and compound symmetry (CS) (Right) correlation structures within the gene groups.

1 : 1
Detect SVGs based on model (1) defined in the main text.Suppose there are m SVGs and denote the SVG list as S y with |S y | = m 1 where | • | denotes the cardinality of a set.

Figure S1 :
Figure S1: The representative spatial genes in the human breast cancer dataset from SVG clusters detected by SPACE.

Figure S2 :
Figure S2: Steps to create a synthetic dataset from an annotated original Spatial dataset.We begin with a filtered gene expression count matrix and a location matrix where spots are annotated.The tissue region contains N spots where gene expression measurements are measured for each of the m genes.Step 1: Randomly select m 1 genes from the original dataset to be converted into new SVGs in the synthetic dataset.The remaining m − m 1 genes will serve as noise genes with no spatial pattern.Step 2: For each of the m 1 genes in the original dataset, sort and arrange the gene expression count values according to the reference domain structure.For the remaining genes, sort and arrange the count values randomly in the tissue region.

Figure S5 :
FigureS5: The SVG-clusters detected from the DLPFC data using SPACE distinctly highlight the disparity between genes in the two main clusters.The representative genes in the second cluster demonstrate overexpression in the white matter region, while those in the first cluster exhibit overexpression in the other six cortex layers.Additionally, the three unique genes in the final row display a slightly different pattern compared to the genes in the main clusters.

Figure S6 :
Figure S6: Annotated and predicted spatial domains for all 12 DLPFC samples using SPACE, alongside ARI scores for each sample indicating prediction accuracy.

Figure
Figure S7: t-SNE plots illustrating genes from all 12 DLPFC samples, with colors representing SVG cluster labels detected by SPACE.Across the majority of samples, distinct clustering is observed, indicating accurate separation of genes within different clusters.

Figure S8 :
Figure S8: Analysis of PDAC Data using SPACE unveils three primary SVG clusters.Representative genes from each cluster (Cluster 1, Cluster 2, and Cluster 3) are showcased.The genes from these distinct clusters exhibit overexpression in three different regions.

Figure S9 :
Figure S9: Analysis of PDAC data: Pathway enrichment analysis of genes from A) list of all SVGs, B) list of top 3000 genes

5
Analysis of human breast cancer dataset List of Figures S1 The representative spatial genes in the human breast cancer dataset from SVG clusters detected by SPACE. . . . . . . . . . . . . . . . . . . . . . .S2 Steps to create a synthetic dataset from an annotated original Spatial S5 The SVG-clusters detected from the DLPFC data using SPACE distinctly highlight the disparity between genes in the two main clusters.The representative genes in the second cluster demonstrate overexpression in the white matter region, while those in the first cluster exhibit overexpression in the other six cortex layers.Additionally, the three unique genes in the Annotated and predicted spatial domains for all 12 DLPFC samples using SPACE, alongside ARI scores for each sample indicating prediction accuracy.13 S7 t-SNE plots illustrating genes from all 12 DLPFC samples, with colors representing SVG cluster labels detected by SPACE.