Points2Regions: Fast, interactive clustering of imaging-based spatial transcriptomics data

Imaging-based spatial transcriptomics techniques generate image data that, once processed, results in a set of spatial points with categorical labels for different mRNA species. A crucial part of analyzing downstream data involves the analysis of these point patterns. Here, biologically interesting patterns can be explored at different spatial scales. Molecular patterns on a cellular level would correspond to cell types, whereas patterns on a millimeter scale would correspond to tissue-level structures. Often, clustering methods are employed to identify and segment regions with distinct point-patterns. Traditional clustering techniques for such data are constrained by reliance on complementary data or extensive machine learning, limiting their applicability to tasks on a particular scale. This paper introduces ‘Points2Regions’, a practical tool for clustering spatial points with categorical labels. Its flexible and computationally efficient clustering approach enables pattern discovery across multiple scales, making it a powerful tool for exploratory analysis. Points2Regions has demonstrated efficient performance in various datasets, adeptly defining biologically relevant regions similar to those found by scale-specific methods. As a Python package integrated into TissUUmaps and a Napari plugin, it offers interactive clustering and visualization, significantly enhancing user experience in data exploration. In essence, Points2Regions presents a user-friendly and simple tool for exploratory analysis of spatial points with categorical labels.


Introduction
Spatial omics is a rapidly growing field, and several spatial omics techniques are derived from imaging methods, such as multiplexed immunohistochemical staining, cyclic immune fluorescence staining, image-mass cytometry (IMC), and imaging-based spatial transcriptomics (IST) [1].These omics techniques enable researchers to study the spatial cellular and molecular architecture in tissue, which is crucial for furthering our understanding of biological processes and diseases.
The data obtained from these imaging techniques are primarily in the form of images.However, processing these images makes it common to end up with a point-like output, where each point has a spatial position and categorical label.For example, an IMC experiment would generate high-plex images with protein expression.Analyzing these images makes it possible to identify points indicating the location of different cell types.Likewise, an IST experiment would produce images with fluorescent spots that combinatorically label mRNA molecules.Through decoding, one ends up with points indicating the location of different mRNA types.
Regardless of the type of label, biological regions of interest are usually identified not just by a single point but by a mixture of different types of points in a region.In IST, in particular, groups of mRNA molecules co-localized within a cell can reveal the cell's function.Likewise, groups mRNAs in a larger 'niche' can reveal biological functions on an organ level.A common procedure for identifying these regions is to cluster spatial neighborhoods based on their composition of labels.The size of these neighborhoods depends on the research question but typically ranges from the subcellular scale up to the scale of the tissue composition or even the organ level.This type of spatial clustering has been widely used within IST, where it is common to cluster spatial neighborhoods corresponding to segmented cells based on their composition of mRNA molecules into clusters corresponding to cell types [2][3][4][4][5][6].
Once the cell types are defined, a second set of niche identification methods comes into play.These methods focus on finding niches by clustering spatial neighborhoods based on their composition of cell types (and/or mRNA molecules) [7][8][9][10][11][12][13]. Here, the neighborhoods are large enough to cover multiple cells.This niche identification type of clustering is relevant to other omics techniques as February 12, 2024 3/25 well.For example, Schurch et al. [14] and Goltsev et al. [15] used an immunofluorescence technique to identify points corresponding to different cell types, and by computing the composition of points in neighborhoods, they identified regions corresponding to cell population niches.
While these cell typing and niche identification methods are powerful, they may face constraints when applied beyond their original context, as they make assumptions about the availability and quality of complementary data, such as cell segmentation [2][3][4][5][6][7][8][9] or single-cell RNA sequencing data [2], thereby restricting their applicability across domains.
An alternative set of methods are cell segmentation-free methods [16][17][18][19][20][21][22].Here, the spatial neighborhoods that are to be clustered are not based on cell segmentation, but instead based on the underlying distribution of points.These methods typically do not rely on complementary data and can potentially be applied in a broader context.A great example is SSAM [17,18], which was used for identifying structures corresponding to both cell types and niches.
A common trait among all these clustering methods is that they are oriented towards exploratory rather than confirmatory analysis.The aim is to uncover hidden structures and define categories for hypothesis generation rather than quantifying already known patterns.However, for practical exploratory analysis, a crucial criterion is flexibility regarding the ability to capture spatial relations at various scales.The discovery of patterns across various spatial resolutions would benefit from a method that is easy to tune and has interpretable parameters that yield biologically relevant results without lengthy optimization.Moreover, the ability to explore data within its spatial context is essential.Given that biological patterns exist at various scales, static visualization often falls short, emphasizing the necessity of interactive solutions.
In spatial omics analysis, we identified a crucial gap -the lack of a versatile, interactive, and simple tool capable of efficiently clustering points across multiple resolutions without relying on additional data or extensive machine learning.Inspired by cell segmentation-free techniques [17][18][19], we realized that clustering of label compositions within spatial bins could efficiently reveal biologically relevant regions with very low constraints on both memory and computing.
We present Points2Regions, a tool for clustering spatial categorical biomarkers February 12, 2024 (points) that work for multiple resolutions and handle varying quantities of points.
Points2Regions does not make any assumption regarding the availability of complementary data, making it accessible to a wide range of applications.
Understanding the significance of also visualizing data within its spatial context and capitalizing on the computational efficiency of Points2Regions, we extended its functionality by adding it as an interactive plugin in TissUUmaps [23], a software for visualizing large quantities of points on high-resolution images.Furthermore, we have also made Points2eEgions available as a plugin in Napari, an interactive, multi-dimensional image viewer for Python [24]. Points2Regions

Efficient feature extraction
The idea behind Points2Regions is to aggregate spatial biomarkers with categorical labels (points) in local neighborhoods.The composition of different types of points in a neighborhood, represented as a compositional vector (CV), serves as a feature used for clustering the neighborhoods into regions with similar point compositions.Depending on the application, the user should be able to choose different spatial scales.For example, identifying cell type niches requires aggregating cell type points over hundreds of micrometers, while subcellular mRNA patterns are visible by aggregating molecules on a micrometer scale.To make the extraction of CVs fast and memory efficient, we employ a hierarchical interpolation approach.
First, each point is assigned to a series of spatial bins of varying sizes (representing varying spatial resolutions).We let y (l) j ∈ R N denote the count of the N different point-types within the j-th bin of the l-th resolution level.Each bin's center position is j .Level l = 0 refers to the highest resolution (smallest bin sizes), and l = L refers to the lowest resolution (largest bin sizes).The grid size, ∆ (l) , increases linearly between the highest and lowest resolution.The CVs are computed at query locations denoted as p ′ j by interpolating the point abundances between the bins using the inverse-distance weighted interpolation: where and We evaluate Eq. 1 on a rigid grid whose nodes are separated by a distance equal to the width of the bins of the lowest resolution level.
The efficiency of Eq. 1 lies in the swift binning of points onto a grid using the hash February 12, 2024 6/25 function h(p) = ⌊ p ∆ (l) ⌋ that immediately links a position p to a bin index.Moreover, the matrices in Eq. 1 are sparse, allowing for easy storage and multiplication using sparse matrix data structures.Supplementary Fig. 1 shows a 1D example of the CV extraction.
We compute the CVs according to Eq 1 before clustering them using mini-batch k-means [25].

Exploring Synergy in Clustering Methods
While k-means clustering is memory-efficient and is generally fast, its limitation is the underlying assumption that clusters are spherical with roughly equal variance.This assumption is not necessarily true.For example, the abundance of a particular biomolecule may vary between different cell types.
Orthogonal to k-means clustering is agglomerative (hierarchical) clustering, which can be applied to various data types without assuming its distribution.Hierarchical clustering, however, tends to be slow when the number of observations is large.
Hypothetically, by over-clustering with k-means and then merging with hierarchical clustering, we can leverage both the speed of k-means and the flexibility of hierarchical clustering, thereby improving the clustering.
There are, however, conditional parameters related to hierarchical clustering, like the linkage metric for comparing clusters, the stopping criterion that determines when to stop merging the clusters, and the linkage method for comparing newly merged clusters.
We study a simulated dataset with a known ground truth to empirically determine the best parameters.We also study two additional conditional parameters related to pre-processing: The choice of normalization norm and whether to use log transformation.

Simulated data for model evaluation
To quantify the effect of different conditional hyperparameters, we simulate two datasets.
First, we consider the simulated osmFISH dataset used in the clustering benchmarking study by Cheng et al. [27].Here, ground-truth labels corresponding to layers in a mouse brain somatosensory cortex were assigned to individual cells whose location and mRNA content are based on information from the real data [28].
Since segmentation-free methods operate on points for individual mRNA species instead of count-per-cell vectors, we randomly place each mRNA point within radii from the cell center.Cell radii varied randomly between 5 [um] and 20 [um] between different cells.
Next, we use pciSeq, a probabilistic method for assessing IST data, to simulate an in situ sequencing dataset.The pciSeq method attempts to assign points for different mRNA species to detected cells and define the cell type by utilizing complementary single-cell RNA sequencing data.The assignment of molecules to cells and the matching of cells to cell types are quantified as probabilities.By setting the probability threshold for an mRNA belonging to a cell to 0.85 and the probability threshold for a cell matching a cell type to 0.85, we identify cleanly segmented and typed cells in the data.
These cells should have mRNA compositions that align well with single-cell RNA sequencing data.We use these cells to identify the expected molecular composition for each cell type.As a final step, we go through all cells segmented and typed by Qian et al [2].If a cell was typed with a probability greater than 0.85, we simulate a new distribution of mRNA molecules using the expected molecular composition of that cell.
This process enables us to create a dataset with clean cell types, free from outliers resulting from inadequate segmentation.This dataset was generated from a mouse brain coronal section (hippocampal CA1 region).There were 14 sections in total.For each section, we used both the left and right hippocampal CA1 region, resulting in 28 replicated ISS datasets.

Selecting optimal conditional parameters
The simulated datasets offer a well-defined ground truth, which allows us to explore the impact of different conditional parameters used in the method.These parameters are (i) the type of normalization norm of CV, (ii) whether to use log-transformation or not, (iii) whether to use the hierarchical merging or not, (iv) which linkage metric to use for comparing clusters, (v) the linkage method for calculating the distance between the newly merged clusters, (vi) and the stopping criterion that determines when to stop the hierarchical merging of clusters.These conditional parameters are further detailed in Supplementary Text. 2.
To assess performance, we use the adjusted rand-index (ARI) and Optuna [31], a stochastic hyper-parameter tuning package.We run 50 Optuna parameter searches (each with 100 iterations and different seeds) and search for the parameter configuration that results in the highest ARI on the simulated osmFISH and ISS dataset.For the ISS dataset, we chose the replica with the most points.The relative frequency of Optuna searches ending on a particular parameter indicates that parameter's score.We choose the best-scoring parameters in the final implementation.

Comparing against prior art
After deciding on the best parameter configuration, we compare Points2Regions against established methods.In the simulated osmFISH dataset, cells are labeled by niches, while in the ISS dataset, individual cells are labeled by type.Therefore, we compare Points2Regions with both niche identification methods [4,8] and cell typing methods [4,21].We optimize hyperparameters using Optuna [31] for all methods, except for [21], where parameters were manually tuned due to lengthy optimization.Additional details are provided in Supplementary Text 1.

Experiments on real data
We extend our analysis to four real datasets: a MERFISH dataset by [29], osmFISH dataset [28], Xenium dataset [30], and a MERFISH subcellular dataset [11], and compare Points2Regions against published work.Statistical information for these datasets is available in Table 1.

Results
Parameter optimization highlights the importance of normalization, log-transformation, and cluster merging First, we ran several parameter searches with Optuna to choose the best conditional parameters.Fig 1A shows the parameter score, that is, the relative frequencies of parameter searches selecting a particular parameter.We noted that almost all models converged to a parameter configuration using log transform, normalization, and hierarchical cluster merging.However, we did not notice a significant difference between which normalization norm to use.We also noted that most of the parameter searches landed in a parameter configuration, including the merging of clusters as post-processing.
Based on this analysis, we deterministically set Points2Regions' linkage metric to cosine distance, its linkage method to complete, and the stopping criterion to maxclust.As described in Supplementary Text 2, this stopping criteria means merging clusters until a given number of maximum clusters are obtained.The maximum number of clusters is decided by the user.The number of k-means clusters is set to 3/2 times this value.
However, a drawback of the "counting points over spatial distances" approach is its insensitivity to edges.Sharp boundaries in spatial biology often separate different regions with distinct point compositions.Counting points over too large a distance may mix points from two regions, leading to potential misinterpretation.Exploring edge-preserving aggregation functions inspired by filters in image analysis, such as bilateral filtering [38] or image restoration via graph cuts [39], could address this issue.
To conclude, Points2Regions provides an efficient interpretation of large and diverse spatial omics datasets.When combined with interactive visualization via TissUUmaps or Napari, Points2Regions can increase our understanding of biological systems at various spatial resolutions.

February
Apart from the conditional parameters mentioned above, Points2Regions comes with four tunable hyperparameters: Number of clusters.Determines the number of clusters.It is problem-specific, but several established methods can guide the user [26].Bin size.The width of the lowest resolution bins is also the stride between February 12, 2024 7/25 query locations for computing CVs.It is a trade-off between spatial resolution and computational speed.A larger bin size yields faster clustering but may result in pixelated regions.Bin smoothing.Governs the spatial distance over which points are aggregated to a query location.The width of the lowest resolution bins is set to this value times the pixel size.The value of this parameter is problem-specific.However, it needs to be large enough to ensure a sufficient number of points are included in the CVs so that a co-varying structure can be observed among them.This can be determined by (i) computing the CVs by incrementally varying the pixel smoothing sizes, (ii) conducting principal component analysis with a fixed number of components, and (iii) tracking the explained variance as a function of smoothing distance.The knee-point on this curve indicates when co-varying structures start to appear, and a sufficiently large smoothing distance is used.Minimum number of markers per bin.This is a simple density threshold parameter used for discarding query locations for CVs where the number of markers per lowest-resolution pixels is less than this value.

FebruaryFig 1 .
Fig 1. Points2Regions evaluated on simulated data.(A) The score for different conditional parameters on simulated data highlights the significance of merging clusters and log transforming CVs.(B-C) Adjusted rand-index results and corresponding wall-clock times for various methods are presented, highlighting the speed and accuracy of Points2Regions.
Fig 1B-C shows the ARI and wall-clock time for the respective methods on the simulated datasets.Points2Region performs similarly

Fig 2 .FebruaryFig 3 .Fig 4 .
Fig 2. Points2Regions breaks down two large spatial transcriptomics datasets.(A) Comparison of clusters defined by Points2Regions and SpaGCN on the MERFISH dataset.Segmented cells are grouped into niches using SpaGCN, and regions are defined by Points2Region without any cell-segmentation.The green and orange arrows show the niches identified by SpaGCN that did not correspond well with any Points2Reions niches.(B) The distribution of Points2Region clusters for each SpaGCN cluster.Despite Points2Regions operating directly on transcripts without cell segmentation, both methods reveal similar clusters.(C) Points2Regions clusters on osmFISH data are shown with original clusters by Codeluppi et al. [28] and Spage2Vec [21].Gray points represent excluded points in the clustering.(D) Distribution of Points2Regions clusters for each Codeluppi cluster.(F-G) Comparison of identified niches using Points2Regions (large smoothing distance) and SpaGCN.