Squidpy: a scalable framework for spatial single cell analysis

Spatial omics data are advancing the study of tissue organization and cellular communication at an unprecedented scale. Here, we present Squidpy, a Python framework that brings together tools from omics and image analysis to enable scalable description of spatial molecular data, such as transcriptome or multivariate proteins. Squidpy provides both infrastructure and numerous analysis methods that allow to efficiently store, manipulate and interactively visualize spatial omics data.

, M a ad L f a 1,2 , Sab a R c e 1,2 , Fab a J. T e 1,2,3+ 1 I e f C a a B g , He Ce e M c , Ge a . He ce, e g a a f a e f a a da a f c e e e-ce g 5 8 e a c a a ec f a a da a a a 9 13 . D e e ac f a f ed da a e e e a a d d a API, e fa ca e f c e e e a a e e e ag g e f a a da , e.g. c b g ea 12 eg a e a a f e age ge e G e f a a a c 11       (e    Online methods

Infrastructure
Spatial graph The spatial graph is a graph of spatial neighbors with cells (or spots in case of Visium) as nodes and neighborhood relations between spots as edges. We use spatial coordinates of spots to identify neighbors among them. Different approach of defining a neighborhood relation among spots are used for different types of spatial datasets.
Visium spatial datasets have a hexagonal outline for their spots, i.e each spot has up to eight spots situated around it. For this type of spatial dataset the parameter n_rings should be used. It specifies for each spot how many hexagonal rings of spots around it will be considered neighbors.
sq . gr . spatial_neighbors ( adata , coord_type = " visium " , n_rings= < int > ) For the other types of spatial datasets neighbors can be defined as the closest spots in terms of euclidean distance between their coordinates. For a fixed number of the closest spots for each spot, it leverages the k-nearest neighbors search from Scikit-learn 1 and n_neigh must be used.
sq . gr . spatial_neighbors ( adata , coord_type = " generic " , n_neigh= < int > ) In order to get all spots within a specified radius (in units of the spatial coordinates) from each spot as neighbors, the parameter radius should be used.
sq . gr . spatial_neighbors ( adata , coord_type = " generic " , radius= < float > ) The function builds a spatial graph and saves its adjacency and weighted adjacency matrices to adata.obsp['spatial_connectivities'] in either Numpy 2 or Scipy sparse arrays 3 . The weights of the weighted adjacency matrix are distances in the case of coord_type="generic" and ordinal numbers of hexagonal rings in the case of coord_type="visium". Together with the connectivities, we also provide a sparse adjacency matrix of distances, saved in adata.obsp['spatial_distances '] We also provide spectral and cosine transformation of the adjacency matrix for uses in graph convolutional networks 4 .

Image Container
The Image Container is an object for microscopy tissue images associated with spatial molecular datasets. The object is a thin wrapper of an xarray.Dataset 5 and provides efficient access to in-memory and on-disk images. On-disk files are loaded lazily using dask 6 through rasterio 7 , meaning content is only read in memory when requested. The object can be saved as a zarr store zarr 8 . This allows handling very large files that do not fit in memory.
Image Container is initialised with an in-memory array or a path to an image file on disk. Images are saved with the key layer. If lazy loading is desired, the chunks parameter needs to be specified.
sq . im . ImageContainer ( PATH , layer = < str > , chunks= < int > ) More images layers with the same spatial dimensions x and y like segmentation masks can be added to an existing Image Container.
img . add_img ( PATH , layer_added = < str > ) The Image Container is able to interface with Anndata objects, in order to relate any pixel-level information to the observations stored in Anndata (e.g. cells, spots etc.). For instance, it is possible to create a generator that yields image's crops on-the-fly corresponding to locations of the spots in the image: spot_generator = img . generate_spot_crops ( adata ) lambda x : (x for x in spot_generator ) # yields crops at spots location This of course works for both features computed at crop-level but also at segmentation-object level. For instance, it is possible to get centroids coordinates as well as several features of the segmentation object that overlap with the spot capture area.
Napari for interactive visualization Napari is a fast, interactive, multi-dimensional image viewer in Python 9 . In squidpy, it is possible to visualize the source image together with any anndata annotation with Napari. Such functionality is useful for fast and interactive exploration of analysis results saved in anndata together with the high resolution image. Furthermore, leveraging Napari functionalities, it is possible to manually annotate tissue areas and assign underlying spots to annotations saved in the Anndata object. Such ability to relate manually defined tissue areas to observations in anndata is particularly useful in settings where there is a pathologist annotation available and it needs to be integrated with analysis at gene expression or image level. All the steps described here are done in Python, therefore available in the same environment where the analysis is performed (it does not require an additional tool). img = sq . im . ImageContainer ( PATH , layer = < str > ) img . interactive ( adata )

Graph and spatial patterns analysis
Neighborhood enrichment test The association between label pairs in the connectivity graph is estimated by counting the sum of nodes that belong to classes i and j (e.g. cluster annotation) and are proximal to each other, noted x ij . To estimate the deviation of this number versus a random configuration of cluster labels in the same connectivity graph, we scramble the cluster labels while maintaining the connectivities, and then recount the number of nodes recovered in each iteration (1,000 times by default). Using these estimates, we calculate expected means (µ ij ) and standard deviations ( ij ) for each pair, and a Z-score as, The Z-score indicates if a cluster pair is over-represented or over-depleted for node-node interactions in the connectivity graph. This approach was first described (to the best of our knowledge) by Schapiro et al 10 . The analysis and visualization can be performed with the analysis code showed below.

Ligand-receptor interaction analysis
We provide a re-implementation of the popular Cellphonedb method for ligand-receptor interaction analysis 12 . In short, it's a permutation-based test of ligand-receptor expression across cell-types combinations. Given a list of annotated ligand-receptor pairs, the test computes the mean expression of the two molecules (ligand, receptor) between cell types, and builds a null-distribution based on n permutations (default 1000). A p-value is computed based on the proportion of the permuted means against the true mean. In Cellphonedb, if a receptor or ligand is composed of several subunits, the minimum expression is considered for the test. In our implementation, we also include the option of taking the mean expression of all molecules in the complex. Our implementation also employs Omnipath 13 as ligand-receptor interaction annotatiojn. A larger database that contains the original CellphoneDB database together with 5 other resources (see Turei et al. 13 ). Finally, our implementation leverages just-in-time compilation with Numba 11 to achieve greater performances in computation time (see Supplementary figure 1).
Ripley's K function is a spatial analysis method used to describe whether points with discrete annotation in space follow random, dispersed or clustered patterns. Ripley'K function can be used to describe the spatial patterning of cell clusters in the area of interest. Ripley's K function is defined as Where I(d i,j < t) is the indicator function, that sets whether the operand is 1 or 0 based on the (euclidean) distance d i,j evaluated at search radius t, A is the average density of point in the area of interest and w i,j is the edge effect correction (see Astropy implementation for details on this term 14 ).
sq . gr . ripley_k ( adata , cluster_key = " < cluster_key > " ) sq . pl . ripley_k ( adata , cluster_key = " < cluster_key > " ) Cluster co-occurrence ratio provides a score on the co-occurrence of clusters of interest across spatial dimensions. It is defined as where cluster is the annotation of interest to be used as conditioning for the co-occurrence of all clusters. It is computed across n radius of size d across the tissue area. It was inspired by an analysis performed by Tosti et al. to investigate tissue organization in the human pancreas with spatial transcriptomics 15 .
sq . gr . co_occurrence ( adata , cluster_key = " < cluster_key > " ) sq . pl . co_occurrence ( adata , cluster_key = " < cluster_key > " ) Global Moran's I is a spatial auto-correlation statistics, widely used in spatial data analysis. Given a feature (gene) and spatial location of observations, it evaluates whether the pattern expressed is clustered, dispersed, or random 16 . It is defined as: where z i is the deviation of the feature from the mean (x i X), w i,j is the spatial weight between observations, n is the number of spatial units. We provide an wrapper for the global Moran's I statistics implemented in libpysal 17 . Test statistics and p values (computed from a permutation based test and further FDR corrected) are stored in adata.uns["moranI"].
sq . gr . moran ( adata , cluster_key = " < cluster_key > " ) Centrality scores provide a numerical analysis on node patterns in the graph, which helps to better understand complex dependencies in large graphs. A centrality is a function C which assigns every vertex v in the graph a numeric value C(v) 2 R. It therefore gives a ranking of the single components (i.e. cells) in the graph which simplifies to identify key individuals. Group centrality measures have been introduced by Everett and Borgatti. 18 . They provide a framework to assess clusters of cells in the graph, i.e. is a specific cell type more central or more connected in the graph than others. Let G = (V, E) be a graph with nodes V and edges E. Additionally, let S be a group of nodes allocated to the same cluster c S . Then N (S) defines the neighbourhood of all nodes in S. The following four (group) centrality measures are implemented. Group degree centrality is defined by the fraction of non-cluster members that are connected to cluster members, so Larger values indicate a more central cluster. Group degree centrality can help to identify essential clusters or cell types in the graph. Group closeness centrality measures how close the cluster is to other nodes in the graph and is calculated by the number of non-group members divided by the sum of all distances from the cluster to all vertices outside the cluster, so where d S,v = min u2S d u,v is the minimal distance of the group S from v. Hence, larger values indicate a greater centrality. Group betweenness centrality measures the proportion of shortest paths connecting pairs of non-group members that pass through the group. Let S be a subset of a graph with vertex set V S . Let g u,v be the number of shortest paths connecting u to v and g u,v (S) be the number of shortest paths connecting u to v passing through S. The group betweeenness centrality is then given by The properties of this centrality score are fundamentally different from degree and closeness centrality scores, hence results often differ. The last measure described is the average clustering coefficient. It describes how well nodes in a graph tend to cluster together. Let n be the number of nodes in S. Then the average clustering coefficient is given by with T (v) being the number of triangles through node v and deg(v) the degree of node v. The describes centrality scores have been implemented using the NetworkX library in python 19 .

Image analysis and segmentation
Image processing Before extracting features from microscopy images, the images can be preprocessed. Squidpy implements functions for commonly used preprocessing functions like conversion to gray-scale or smoothing using a gaussian kernel. sq . im . process ( img , method = " gray " ) img . show () Implementations are based on the Scikit-image package 20 and allow processing of very large images through tiling the image into smaller crops and processing these.
Image segmentation Nuclei segmentation is an important step when analysing microscopy images. It allows the quantitative analysis of the number of nuclei, their areas, and morphological features. There are a wide range of approaches for nuclei segmentation, from established techniques like thresholding to modern deep learning-based approaches.
A difficulty for nuclei segmentation is to distinguish between partially overlapping nuclei. Watershed is a classic algorithm used to separate overlapping objects by treating pixel values as local topology. For this, starting from points of lowest intensity, the image is flooded until basins from different starting points meet at the watershed ridge lines. Custom approaches with deep learning Depending on the quality of the data, simple segmentation approaches like watershed might not be appropriate. Nowadays, many complex segmentation algorithms are provided as pre-trained deep learning models, such as Stardist 21 , Splinedist 22 and Cellpose 23 . These models can be easily used within the segmentation function.
sq . im . segment ( img , method = < pre -trained model > ) img . show () Image features Tissue organisation in microscopic images can be analysed with different image features. This filters relevant information from the (high dimensional) images, allowing for easy interpretation and comparison with other features obtained at the same spatial location. Image features are calculated from the tissue image at each location (x, y) where there is transcriptomics information available, resulting in a obs x features features matrix similar to the obs x gene matrix. This image feature matrix can then be used in any single-cell analysis workflow, just like the gene matrix.
The scale and size of the image used to calculate features can be adjusted using the scale and spot_scale parameters. Feature extraction can be parallelized by providing n_jobs (see Supplementary Figure 1). The calculated feature matrix is stored in adata[key ] . sq . im . calculate_image_features ( adata , img , features = < list > , spot_scale= < float > , scale = < float > , key_added= < str > ) Summary features calculate the mean, the standard variation or specific quantiles for a color channel. Similarly, histogram features scan the histogram of a color channel to calculate quantiles according a defined number of bins.
sq . im . calculate_image_features ( adata , img , features = " summary " ) sq . im . calculate_image_features ( adata , img , features = " histogram " ) Textural features calculate statistics over a histogram that describes the signatures of textures. To grasp the concept of texture intuitively the inextricable relationship between texture and tone is considered 24 : if a small-area patch of an image has little variation in it's gray tone the dominant property of that area is tone. If the patch has a wide variation of gray tone features, the dominant property of the area is texture. An image has a simple texture if it consists of recurring textural features. For a grey level image I or e.g. a fluorescence color channel, a co-occurrence matrix C is computed. C is a histogram over pairs of pixels (i, j) with specific values (p, q) 2 [0, 1, ..., 255] 2 and a fixed pixel offset: C p,q = X i I(i),p I(j),q with Kronecker-Delta . The offset is a fixed pixel distance from i to j under a fixed direction angle. Based on the co-occurence matrix different meaningful statistics (texture properties) can be calculated which summarize textural pattern characteristics of the image: X p,q C p,q (p q) 2 contrast X p,q C p,q |p q| dissimilarity X p,q C p,q 1 + (p q) 2 homogeneity sq . im . calculate_image_features ( adata , img , features = " texture " ) All the above implementations rely on the Scikit-image python package 20 .
Segmentation features Similar to image features that are extracted from raw tissue images, segmentation features can be extracted from a segmentation object (3). These features allow to get statistics over the number, area, and morphology of the nuclei in one image. To compute these features, the Image Container img needs to contain a segmented image at layer <segmented_img> sq . im . calculate_image_features ( adata , img , features = " segmentation " , features_kwargs = { " label_layer " : < segmented_img > } ) Custom features based on deep learning models Squidpy feature calculation function can also be used with custom user-defined features extraction functions. This enables the use of e.g., pre-trained deep learning models as feature extractors.