A general framework for the combined morphometric, transcriptomic, and physiological analysis of cells using metric geometry

Morphology is an essential phenotype in the characterization of cells and their states, as it is closely related to cell function. Recent advances like Patch-seq enable simultaneously profiling the morphology, gene expression, and physiological properties of individual cells. However, computational methods that can summarize the great diversity of complex cell morphologies found in tissues and infer associations with other single-cell data modalities remain scarce. Here we report a computational framework, named CAJAL , for the morphometric and multi-modal analysis of single-cell data. CAJAL uses concepts from metric geometry to accurately build and visualize cell morphology summary spaces, integrate cellular morphologies across technologies, and establish associations between morphological, molecular, and physiological cellular processes. We demonstrate the utility of CAJAL by applying it to published Patch-seq, patch-clamp, serial electron, and two-photon microscopy data, and show that it represents a substantial improvement in functionality, scope, and accuracy with respect to current methods for cell morphometry.


Introduction
The idea that the morphology of a cell is related to its function has been central to major discoveries in cell biology over the past century, including the neuron doctrine 1 , the molecular basis of sickle cell disease 2 , and the pathways for cell migration and chemotropic sensing 3 . In the digital era, the study of cell morphology continues to have a prominent role in biology and medicine. Cell shape can be indicative of disease and is a key diagnostic tool for some pathologies 4 . The development of image-based drug screens on cultured cells has uncovered the mechanism of action of multiple drugs 5,6 . In the nervous system, thousands of neurons have been morphologically characterized using whole-cell patch-clamp 7 , and the recent incorporation of high-throughput single-cell RNA-seq onto this technique, known as Patch-seq [8][9][10][11][12] , has produced deeper characterizations that include concurrent morphological, transcriptomic, and electrophysiological information of individual cells 13 . Additionally, the combination of Credependent sparse labeling, high-resolution microscopy, and slice alternation methods has led to accurate morphological reconstructions of individual cells across macroscopic volumes 14,15 . The potential of this new array of experimental techniques is immense, not only for cell taxonomic purposes [16][17][18][19][20][21] , but also for uncovering the molecular and physiological determinants of cell morphology. However, to ensure continued progress in this direction, the implementation of high-resolution cell morphology profiling techniques needs to be accompanied by the development of computational methods that can take full advantage of them.
Algorithms for cell morphometry seek to determine similarities among the morphology of individual cells in digitally reconstructed microscopy images. They extract a set of shape descriptors that summarize the morphology of each cell, and then quantify differences between descriptors. Several types of descriptors are commonly used. Simple geometric descriptors consist of general morphological readouts like the area, perimeter, and lengths of major and minor axes 22,23 . They can be applied to most cell types and are invariant under rigid transformations. However, they cannot accurately discriminate complex cell morphologies like those of neurons and glia. Cell-type-specific descriptors include more complex morphological characteristics, such as neuronal branching patterns (e.g. Sholl analysis 24 , L-measure 25 , and SNT 26 ), which can be used to summarize complex 3D morphologies. However, these descriptors need to be tailored to the specific cell type of interest and cannot be broadly applied.
In addition, they are arbitrary with respect to the features that are used and the weight assigned to them. To overcome these limitations, other methods generate a similarity score based on tree alignment (e.g. NBLAST 27 and BlastNeuron 28 ) or decomposition in Fourier, Zernike, or spherical harmonic moments [29][30][31][32] . These methods require building combinations of descriptors that are invariant under rigid transformations or carefully pre-aligning the cells using Procrustes analysis, and they fail to accurately quantify morphological differences between highly dissimilar cells.
Additionally, the similarity scores produced by these and other methods, like those based on persistent homology 33,34 , do not directly reflect the biophysical processes involved in cell morphological changes and do not lead to an actual distance function. These limitations have precluded the development of algebraic and statistical approaches for integrating and batchcorrecting cell morphology spaces, constructing consensus cell morphologies, or inferring cell state trajectories associated with morphological processes.
Here, we build upon recent developments in applied geometry [35][36][37] to establish a computational framework for summarizing complex and heterogeneous 2D and 3D cell morphologies across the broad range of cells found in mammalian tissues. This framework enables the characterization of morphological cellular processes from a biophysical perspective and produces an actual mathematical distance upon which rigorous algebraic and statistical analytic approaches can be built. The proposed framework has the generality and stability of simple geometric shape descriptors, the discriminative power of cell-type specific descriptors, and the unbiasedness and hierarchical structure of moments-based descriptors. Using this approach, we address several gaps in relating cell morphology to mechanism and function, including the combined analysis of morphological, molecular, and physiological information of individual cells, and the integration of morphological information across datasets and technologies. We expect this framework, and the accompanying software CAJAL, to greatly benefit researchers working with cell morphology data by not only increasing the accuracy and versatility of morphometric analyses, but also enabling currently unavailable analyses such as the integration cell morphology data across experiments and technologies.

A general framework for the quantitative analysis of cell morphology data
In its simplest formulation, the study of cell morphology involves the quantitative comparison of cell shapes irrespectively of distance-preserving transformations (isometries), such as rotations and translations. From a mathematical standpoint, this is a problem of metric geometry. The Gromov-Hausdorf (GH) distance 38, 39 measures how far two compact metric spaces are from being isometric. In physical terms, it determines the minimum amount of deformation required to convert the shape of an object into that of another. The use of the GH distance to describe cell shapes would be therefore broadly applicable to any cell type, as it does not rely on geometric features that are particular to the cell type or require pre-aligning the cells to a reference shape. Because of these reasons, we sought to develop a general computational framework for cell morphometry by building upon these concepts in metric geometry.
Since the calculation of the GH distance is computationally intractable even for relatively small datasets, we based our approach upon a computationally efficient approximation, referred to as the Gromov-Wasserstein (GW) distance [35][36][37] . The GW distance preserves most of the mathematical properties of the GH distance 36 . Although its running time grows cubically with the number of points, its efficiency can be further improved by means of nearly linear-time approximations that build upon optimal transport regularization 40,41 and nesting strategies 42 .
The starting point to our analytic framework is the 2D or 3D segmentation masks of individual cells, which are discretized by evenly sampling landmark points from their outline (Fig.   1a). For each cell, we compute the pairwise distance matrix ( ) between its landmark points.
Then, for each pair of cells, and , we compute the GW distance between the matrices and using optimal transport theory (Fig. 1b). The result of this process is a pairwise GW distance that quantifies the morphological differences between each pair of cells.
Different metrics for measuring distances between landmark points lead to different properties of the GW distance matrix, which may be advantageous in specific applications (Fig.   1a). Euclidean distance results in a GW matrix that accounts for the positioning of cell appendages, which can be particularly useful in the study of neuronal projections. Geodesic distance results in a GW matrix that is invariant under bending deformations of the cell, and it is therefore particularly sensitive to topological features such as the branching structure of cell appendages.
In all cases, the GW distance that results from this approach can be thought of as a distance in a latent space of cell morphologies (Fig. 1c). By phrasing the problem in this way, we can use existing tools of statistics and machine learning to define cell populations based on their morphology, dimensionally reduce and visualize the space of cell morphologies, integrate morphological datasets across tissues and technologies, integrate morphological data with other single-cell data modalities (for example, single-cell RNA-seq or ATAC-seq data), or infer trajectories associated with dynamic morphological processes. We implemented this general framework for cell morphometry in an open-source Python library, called CAJAL, that enables these analyses for arbitrarily complex and heterogeneous cell populations (Fig. 1d).

GW cell morphology spaces accurately summarize complex cell shapes
To assess the ability of CAJAL to summarize complex cell morphologies, we applied it to the 3D basal and apical dendrite reconstructions of 506 neurons from the mouse visual cortex using published whole-cell patch-clamp data 20 . The space of cell morphologies produced by our approach recapitulated the morphological neuronal types of the visual cortex (Fig. 2a). Cells with a similar morphology appeared in proximity in the UMAP representation of this space.
Consistently, molecularly defined neuronal types were localized in different regions of the UMAP representation ( Fig. 2b), demonstrating the ability of CAJAL to discriminate cell subtypes based on their morphology. Excitatory and inhibitory neurons clustered separately, and individual neurons were organized in the morphology space according to their cortical layer and Cre driver line (Fig. 2b). We clustered the morphology space using Louvain community detection 43 and partitioned it into 9 morphological populations of neurons. Using the metric structure of the GW morphology space, we then computed the medoid and average morphology of each morphological cell population (Fig. 2c). The resulting summaries accurately represented the main morphological characteristics of each cell population and were consistent with the diversity of neuronal morphologies found in the mouse visual cortex 20 .
To quantitatively evaluate the ability of CAJAL to accurately summarize complex neuron morphologies in comparison to state-of-the-art methods for neuron morphometry, we considered two Patch-seq datasets of the visual 19 and motor cortex 21 in addition to the pathclamp dataset of the visual cortex. For each dataset, we assessed the ability of each morphometric method to identify morphological differences between molecularly defined neurons. In the case of the patch-clamp dataset, we considered neurons labeled with different Cre driver lines, for a total of 31 lines, with the understanding that each line preferentially labels distinct neuronal types. In the case of the two Patch-seq datasets, we considered the classification of motor and visual cortex neurons into 9 and 6 transcriptionally defined subtypes (t-types), respectively, based on their gene expression profile 19,21 . We used three different metrics of performance to evaluate the ability of CAJAL and 5 other methods (Sholl analysis 24 , L-measure 25 , SNT 26 , NBLAST 27 , and TMD 33 ) to predict the molecular type of each neuron based on its morphology: the multi-class Matthews Correlation Coefficient of a nearest neighbor classifier, the Calinski-Harabasz clustering score of neurons from the same molecular type in the cell morphology space, and the benchmarking score introduced in a previous study of cell shape analysis methods 30 . In our comparative study, we found that CAJAL outperformed existing methods for neuron morphometry (Fig. 2d). Moreover, some of the best performing methods, such as TMD, produced errors and failed to summarize the morphology of 26 neurons for which the model assumptions were violated. A similar evaluation of the ability of each algorithm to identify 47 t-types of inhibitory neurons and 26 t-types of excitatory neurons in the motor cortex 21 led to consistent results ( Supplementary Fig. 1). Expectedly, the accuracy of CAJAL increased with the number of landmark points that are sampled from the outline of the cell ( Supplementary Fig. 2). However, in these data the accuracy saturated at approximately 100 landmark points, indicating no major advantage in sampling a larger number of points.
Altogether, these results demonstrate the utility and versatility of CAJAL to perform unbiased studies of complex cell morphologies.

GW cell morphology spaces summarize cell shapes across heterogeneous cell populations
As argued above, one of the main advantages of adopting a geometric approach for cell morphometry is the generality of this approach. To demonstrate this aspect, we applied CAJAL to study the morphologies of 70,510 cells from a cubic millimeter volume of the mouse visual cortex profiled by the Machine Intelligence from Cortical Networks (MICrONS) program using two-photon microscopy, microtomography, and serial electron microscopy 15 . This dataset includes not only neurons, but also multiple types of glia and other cells. The UMAP representation of the cell morphology space produced by CAJAL recapitulated the diverse cell types present in the tissue, including multiple populations of neurons, astrocytes, microglia, and immune cells (Fig. 3a). These populations were also consistent with the manual annotations of 185 individual cells provided by the MICrONS program (Fig. 3b). As with our analyses of neuronal patch-clamp data, neurons from different cortical layers were associated with distinct regions of the cell morphology space (Fig. 3c). In addition, our analysis of the cell morphology space uncovered morphological differences between astrocytes from different cortical layers (Figs. 3d, e). Layer 1 astrocytes were substantially smaller, and layer 2/3 astrocytes elongated perpendicularly to the pial surface, compared to other astrocytes (Figs. 3d, e). These morphological differences were consistent with those found in a recent study using Glast-EMTB-GFP transgenic mice 44 .
To quantitatively evaluate the ability of CAJAL to summarize cell morphology across different cell types in comparison to existing general approaches for cell shape analysis, we considered the 3D morphological reconstructions of 512 T cells from the mouse popliteal lymph node, submandibular salivary gland, and skin, profiled with intra-vital two-photon microscopy 29 , in addition to the patch-clamp dataset of the mouse visual cortex 20 . We evaluated the ability of CAJAL and 4 other general approaches for cell shape analysis (CellProfiler 45 , SPHARM 29, 32 , Zernike moments 30,31 , and the PCA-based approach of Celltool 30 and VAMPIRE 46 ) to predict the anatomical location of T cells and the Cre driver line of neurons based on their morphology.
According to most metrics, CAJAL was the most capable algorithm at separating the morphologies of T cells from different tissues (Fig. 3f). Moreover, all methods for general cell shape analysis except for CAJAL performed poorly in the analysis of complex neuronal morphologies (Fig. 3f). These results demonstrate that the proposed approach overcomes the limitations of current methods to summarize the broad range of complex cell shapes present in mammalian tissues.

Multimodal analysis of GW cell morphology spaces uncovers genetic determinants of cell morphology
Cell morphology reflects the progression of high-level cellular processes, such as cell differentiation, adhesion, migration, and plasticity. The combined analysis of morphological and genomic data from individual cells has the potential to unravel the molecular pathways that are associated with, and may ultimately drive, these morphological processes.
Since cell morphology changes are continuous, establishing associations between cell morphology and other data, such as genetic, molecular, and clinical data, is best accomplished by methods of analysis that are purpose-built for continuous processes. We built upon our previous work on clustering-independent analyses of omics data 47 to implement a statistical approach for identifying biological features that are associated with changes in cell morphology.
In essence, we use the Laplacian score for feature selection 48 to test the association between the values of each molecular or clinical feature and the structure of the morphology space, while accounting for user-specified covariates such as the developmental stage or animal age (Fig.   4a). We used this approach to identify molecular mechanisms that contribute to morphological plasticity of neurons in C. elegans. For that purpose, we considered the 3D morphological reconstructions of the male C. elegans GABAergic DVB interneuron in 12 gene mutants, 5 double mutants, and controls across days 1 to 5 of adulthood (Supplementary Table 2), including 7 gene mutants and 1 double mutant from a previous study 49 . The DVB neuron develops post-embryonically in the dorsorectal ganglion and innervates enteric and spicule protraction muscles in the male tail. It undergoes post-developmental neurite outgrowth in males, altering its morphology and synaptic connectivity, and contributing to changes in the spicule protraction step of male mating behavior 50 . We applied our approach to identify loss of function mutations that are associated with changes in the dynamic morphology of this neuron, taking the age of the worm as a covariate to reliably compare morphological changes across timepoints in adulthood. This analysis identified mutations in unc-97 and nlg-1 as significantly affecting the morphology of the DVB neuron (Fig. 4b, c; Laplacian score permutation test, FDR < 0.05). By repeating this analysis for worms of each age separately, we identified the timepoint at which each of these mutations starts significantly affecting the morphology of the DVB neuron (Fig 4d). To interpret these morphological differences in terms of neuronal characteristics, we used the same approach to evaluate the association of 33 morphological features with the structure of the cell morphology space (Supplementary Table 1). Within these significantly associated features, we found that mutations in nlg-1 caused an increase in neurite length and number of branches compared to control worms (Wilcoxon rank-sum test p-values = 10 -10 and 10 -7 , respectively), while inactivating mutations in unc-97 stunted neurite growth (Wilcoxon ranksum test p-values = 10 -22 and 10 -14 , respectively). Altogether, these results confirm previous findings 49 and extend them by uncovering quantitative differences in the age of onset of the morphological alterations induced by these mutations.

Integrative analysis of molecular, physiological, and morphological data from single cells
Patch-seq incorporates high-throughput single-cell RNA-seq onto whole-cell patchclamp to simultaneously measure the transcriptome, electrophysiology, and morphology of individual neurons 13 . Multimodal analyses of Patch-seq data therefore have the potential to uncover transcriptomic and elecrophysiological programs associated with morphological changes. We used CAJAL to analyze the basal and apical dendrites of 370 inhibitory and 274 excitatory motor cortical neurons profiled with Patch-seq 21 . Consistent with our previous analyses, the cell morphology space captured morphological differences between the dendrites of different neuronal t-types and cortical layers (Fig. 5a, b). To characterize the gene expression and electrophysiological programs associated with these differences, we used the above approach based on the Laplacian score to identify genes and electrophysiological features that are significantly associated with the structure of the cell morphology space. We considered inhibitory and excitatory neurons separately. Our analysis identified 173 and 556 genes, and 14 and 22 electrophysiological features, respectively, that were significantly associated with the morphological diversity of dendrites ( Fig. 5c and Supplementary Tables 3 and 4; Laplacian score permutation test, FDR < 0.1). Among the 7 genes that were significant for both excitatory and inhibitory neurons, there were several genes that are known to be involved in dendrite morphogenesis, such as Dscam, which plays a central role in pathways for dendritic selfavoidance 51 , and Pcdh7, which regulates dendritic spine morphology and synaptic function 52 .
More broadly, gene ontology enrichment analysis for biological processes revealed an enrichment for neuron projection morphogenesis among the genes that were significantly associated with the morphological diversity of excitatory neurons (GO enrichment adjusted pvalue = 0.009). Similarly, it identified an enrichment for cellular response to chemical stimulus, neuron differentiation, and taxis among the genes significantly associated with the morphological diversity of inhibitory neurons (GO enrichment adjusted p-values = 0.006, 0.008, and 0.01, respectively).
To investigate if these processes are part of continuous morpho-transcriptomic programs, we computed the RNA velocity field 53,54 in the gene expression space. The RNA velocity field predicts the future transcriptional state of each cell based on the observed ratio between unspliced and spliced transcripts 53,54 . The time scale of these predictions is determined by the mRNA degradation rate and is of the order of hours. We reasoned that projecting the RNA velocity field onto the GW cell morphology space would allow us to identify cellular processes that involve continuous and consistent changes in gene expression and cell morphology. This approach revealed several morpho-transcriptomic trajectories involving chandelier, basket, and Lamp5 + neurons (Fig. 5d). Cells along these trajectories showed increased complexity in their apical and basal dendrites in parallel to changes in their gene expression profile, in agreement with the presence of molecular programs associated with the plasticity of these neuronal types.
More generally, we evaluated the degree of consistency between the electrophysiological, transcriptomic, and morphological characteristics of neurons of the same ttype by representing the pairwise distance between each pair of cells in the three latent spaces (transcriptomic, electrophysiological, and morphological) as a point in a 2D simplex (Fig. 5e).
This analysis revealed a large degree of variability in the morphology of the dendrites of extratelecephalic-projecting (ET) neurons that was not paralleled by their gene expression profile. In contrast, the dendrites of Lamp5 + and bipolar (Vip + ) GABAergic neurons presented limited morphological variability in comparison to their gene expression profile (Fig. 5e).
Altogether, these results demonstrate the utility of CAJAL to identify molecular and electrophysiological programs associated with cell morphology changes based on single-cell Patch-seq data.

Integration of cell morphology spaces across technologies
Advances in cell morphology profiling techniques have led to an explosion of highresolution cell morphology data over the past decade 7 . The ability to perform integrated analyses of such data regardless of the experimental approach and technology that was used to generate them would be a powerful tool for refining taxonomic classifications of cells and imputing missing data. For example, it would enable inferring the transcriptome of cells profiled with patch-clamp based on their morphology by using reference Patch-seq datasets.
We used CAJAL to build a combined morphology space of the basal and apical dendrites of visual cortex neurons profiled with patch-clamp 20 and visual and motor cortex Martinotti cells annotated by MICrONS, one cell presenting a distinct morphology with a densely arborized axon was closer in the morphology space to Patch-seq cells of the Sst Chrna2 t-type ( Fig. 6d; Wilcoxon rank-sum test p-value = 0.045), while the other two Martinotti cells were closer to Sst Calb2 t-type cells ( Fig. 6d; Wilcoxon rank-sum p-value = 10 -3 ). Expression of Chrna2 has been shown to mark layer 5 Martinotti cells that project into layer 1 55 , and we confirmed that the soma of the predicted Chrna2 Martinotti cell was located in layer 5 while its longest reaching axon ended in layer 1 (Fig. 6d). Similarly, among the manually annotated basket cells in the MICrONS dataset, one had a more condensed morphology than the others Taken together, these results demonstrate the utility of GW cell morphology spaces to perform integrative analyses of cell morphology across datasets and technologies, and they constitute a conceptual basis for the development of algorithms for cell morphology data integration and batch-correction.

Discussion
Shape registration has experienced several breakthroughs over the past 15 years with the formalization of new paradigms that allow for more flexibility in the quantification of morphology 57 . Here, we built upon one of these constructions, the GW distance, to develop a general computational framework and software for the combined morphometric, transcriptomic, and physiological analysis of individual cells. The proposed framework does not rely on predefined morphological features, is insensitive to rigid transformations, and can be efficiently used with arbitrarily complex and heterogeneous cell morphologies. Using this approach, we have been able to accurately build, analyze, and visualize cell morphology summary spaces, where each cell is represented by a point and distances between cells indicate the amount of physical deformation needed to change the morphology of one cell into that of another. We have integrated morphological data across experiments and technologies; identified morphological, molecular, and physiological features that define cell populations; and established associations between morphological, molecular, and electrophysiological cellular processes. Our quantitative comparative studies using published Patch-seq, patch-clamp, and two-photon microscopy datasets demonstrate that the proposed approach represents a boost in accuracy, functionality, and scope with respect to current methods for the analysis of cell morphology data.
There are several limitations of the proposed approach, the most important one perhaps being its running time. In our studies, the application of CAJAL to 644 digital neuron reconstructions on an 8-core desktop computer took 70 minutes. We expect that the implementation of recent strategies for reducing the computing time of the GW distance 40,42 might improve the scalability of CAJAL to larger datasets without having to reduce the number of sampled landmarks per cell. This will become particularly important as the throughput of technologies for high-resolution cell morphology profiling continues to grow. In addition, the interpretability of cell morphology spaces in terms of specific morphological features is limited.
To address this aspect, in our studies we evaluated interpretable morphological descriptors produced by methods like SNT 26 , L-measure 25 , or CellProfiler 45 on the cell morphology space produced by CAJAL to combine the interpretability of these descriptors with the unbiasedness and accuracy of cell morphology spaces.
Overall, the application of metric geometry to cell morphometry serves as a pillar for the development of currently missing computational methods for integrating and batch-correcting cell morphology data, as well as for modelling continuous morphological cellular processes. We expect the development of these methods during the coming years will significantly impact our understanding of the relation between the morphological, molecular, and physiological diversification of cells.

Computation of GW cell morphology spaces
We build upon the application of metric geometry to the problem of finding a correspondence between two point-clouds such that the size of non-isometric local transformations is minimized [35][36][37] . CAJAL takes as input the digitally reconstructed cells. For each cell , it samples points regularly from the outline and computes their pairwise distance matrix, . It then computes the GW distance between every pair of distance matrices, where the matrix specifies a weighted pointwise matching between the points of cells and , and represents the space of all possible weighted assignments 36 . By construction, CAJAL does not require pre-aligning cell outlines, since the input to is the pairwise distance matrix within each cell, , which is invariant under rigid transformations. Depending on the application, we consider two choices for the distances : Euclidean and geodesic distance.
The output is a metric space for cell morphologies which can then be clustered and visualized using standard procedures, such as Louvain community detection 43 and UMAP 58 . For each population of cells, , we compute its average morphology as the distance matrix where med( ) denotes the medoid of with respect to the distance matrix. The morphology can then be visualized by computing the shortest-path tree or multidimensional scaling (MDS) of ̂. In addition, to facilitate the interpretation of morphology spaces, we find it is useful to plot the values of standard morphological descriptors like cell height, width, diameter, neuronal depth, or fractal dimension 25,26,45 in the UMAP representation of the cell morphology space.

Evaluation of features on the cell morphology space
To evaluate features, such as gene expression or electrophysiological properties, on GW cell morphology spaces, we build upon a spectral approach for clustering-independent analyses of multimodal data 47,48 . We first construct a radius neighbor graph of the distance with radius . Each feature is represented by a vector of length the number of cells.
The Laplacian score of on the cell morphology space is then given by 48

Processing of Patch-seq and patch-clamp morphological reconstructions
We downloaded the morphological reconstructions of neurons from several repositories.
For the Gouwens et al. 20 Patch-clamp dataset, we downloaded 509 reconstructions in SWC format from the Allen Cell Types database, using Cell Feature Search and selecting for "Full" or "Dendrite Only" reconstruction types. Three of the SWC files were unsorted and were left out of further processing, for a total of 506 neurons. For the Gouwens et al. 19 Patch-seq dataset, we downloaded 574 reconstructions from the Brain Image Library (BIL) repository. We removed 62 neurons that did not have assigned transcriptomic types, for a total of 512 neurons. Lastly, for the Scala et al. 21 Patch-seq dataset, we downloaded 645 reconstructions from the inhibitory and excitatory sets from the BIL repository, skipping the one inhibitory neuron that had no dendrites, for a total of 644 neurons.
The SWC format represents each neuron as a tree of vertices, such that an edge can be drawn between a vertex and its parent, forming the skeleton of the neuron. From this format, we sampled 100 points radially around the soma at a given step size. We used a binary search to identify the step size which returns the required amount of evenly spaced points. To calculate the pairwise geodesic distance between these points, we constructed a weighted graph with weights given by the distance to the latest sampled point. We then used the Floyd-Warshall algorithm implemented in Networkx 59 to compute all pairwise shortest path distances in this graph. Alternatively, we computed the pairwise Euclidean distance between the 3D coordinates of these points.
We the computed the GW distance between each pair of cells as described above (subheading "Computation of GW cell morphology spaces") using the ot.gromov.gromov_wasserstein function of the "POT: Python Optimal Transport" Python library 60 . We then used this precomputed distance to build a 2D visualization of the morphology space using the https://github.com/tkonopka/umap package in R. We computed the Louvain clusters of a KNN graph of the GW distance using the multilevel.community function of the igraph R package 61 .

Average shape of neurons
To compute the average shape of a cluster of cells in the GW cell morphology space, we first found the medoid cell as the cell with the minimum sum of distances to all other cells in the cluster. To compute a morphological distance between cells, the GW algorithm identifies an optimal matching between the landmark points we have sampled (subheading "Computation of GW cell morphology spaces"). We used this matching to align the other cells in the cluster to the medoid, by reordering the pairwise geodesic distance matrix of their sampled points to match the distance matrix of the medoid cell. We rescaled the geodesic distance matrix of each cell into an unweighted graph distance by dividing out the minimum distance between any two points, so that the rescaled distances were integers. We set a threshold on these distances at 2, such that the distance was 0 from the point to itself, 1 to an adjacent point in the tree of the neuron trace, and 2 to any farther point. We averaged all of these distance matrices together over the cells in the cluster and built a = 3 nearest neighbor graph, essentially connecting each landmark point to the three other points it was most often adjacent in the neurons of that cluster. We took the shortest path tree in this graph as the average shape for that cluster using Dijkstra's algorithm. We color each point in this average shape by a confidence value based on its minimum original unweighted graph distance, summed over the cluster, to any other point.

Comparison of CAJAL with current methods for neuronal morphometry
We compared our approach to five other morphological methods for neuron analysis by applying them to the dendrite reconstructions of the neuronal datasets listed above (subheading "Processing of Patch-seq and patch-clamp morphological reconstructions"). These methods have stricter assumptions on the input, forcing us to remove disconnected dendrites from the reconstructions. We applied NBLAST 27 , as implemented in the nat.nblast R package (https://github.com/natverse/nat.nblast). We calculated a pairwise distance between all neurons using the nblast_allbyall function with the mean normalization method. We ran the Topological Morphology Descriptor (TMD) method of Kanari et al. 33 24 . We also computed morphological features using L-Measure 25 , selecting all of their provided functions.
We used three different metrics to assess the ability of these algorithms to identify morphological differences between Cre lines or transcriptomic types. We computed the Calinski-Harabasz clustering score using cluster.stats from the fpc R package (https://cran.rproject.org/package=fpc). We also implemented the median-based group discrimination statistic used by Pincus et al. 30 to compare methods for cell-shape analysis. Lastly, we used a 7-fold = 10 nearest neighbor classifier from the scikit-learn Python library to predict the Cre driver line or t-type of each cell based on morphological distance and used the Matthew's correlation coefficient to evaluate the accuracy of the predictions.

Morphological analysis of the MICrONS dataset
We downloaded the 113,182 static cell segmentation meshes from MICrONS using the trimesh_io module from the package MeshParty (https://meshparty.readthedocs.io/) at the lowest resolution (resolution 3). We then downloaded higher resolution meshes for cells that had less than 10,000 vertices at this lowest resolution. Cells with less than 1,001 vertices at the lowest resolution were re-downloaded at the highest resolution (resolution 0). Cells with 1,001 to 3,000 vertices at the lowest resolution were re-downloaded at resolution 1, and cells with 3,001 to 10,000 vertices were re-downloaded at resolution 2.
Along with other metadata available through the CAVEclient (https://github.com/seunglab/CAVEclient), such as the 3D coordinates of neuron soma, we collected the cell IDs for each manually annotated cell type provided by the MICrONS program in their website. We used the layer 2/3, layer 4, and layer 5 manually annotated cells to estimate cortical layer boundaries in the y values of the 3D soma coordinates. We placed these cutoffs at layer 1 < 104,191 < layer 2/3 < 133,616 < layer 4 < 179,168 < layer 5 < 213,824 < layer 6.
We sampled 50 vertices from the triangular mesh of each cell, using the linspace function of the NumPy package 62 to evenly select vertices, since vertices were roughly ordered by proximity, and this gives an approximation of even sampling over the 3D space. We skipped the very large blood vessel mesh and 240 meshes with less than 50 vertices, for a total of 112,941 meshes. We used the heat method 63  We found that some morphological clusters mostly consisted of artifacts or neuron-glia doublets and removed those. In addition, another morphological cluster contained both neuronneuron doublets and individual neurons with complex morphologies, so we devised an approach to remove meshes containing multiple somas from that cluster. We determined the number of somas in each mesh from that cluster by using MeshParty to skeletonize the meshes and convert them into graph representations where nodes have a radius value, and nodes within soma regions fall in a specific range of radii. For us, this range was 4,000 to 30,000. We used HDBSCAN 66 to cluster these nodes in the 3D space and counted each cluster with at least three nodes as a soma. Meshes with more than one soma were removed from the cluster. Lastly, we noticed that many meshes with very high y coordinates appeared stretched, so we removed meshes with a y soma coordinate greater than 240,000. After removing all these artifacts, we recomputed a UMAP visualization of the remaining 70,510 cells in the cell morphology space using umap-learn.
For each astrocyte, we measured the bounding box by placing lower and upper bounds on the 1% and 99% quantiles of the mesh vertices along each of the first three principal components. We took the arccosine of the first principal component along the y axis to be the orientation angle of the astrocyte and measured its deviation from perpendicular.

Morphological analysis of T cells
We retrieved 512 3D meshes of T cells from Medyukhina et al. 29 . We evenly sampled 200 points from the list of vertices in each mesh, which approximates an even sampling in 3D space since the vertices are roughly ordered in a spiral down the cell. We computed the GW distance of the pairwise Euclidean distances between these points as described above (subheadings "Computation of GW cell morphology spaces" and "Processing of Patch-seq and patch-clamp morphological reconstructions").

Comparison of CAJAL with general methods for cell morphometry
We applied the Celltool method of Pincus et al. 30  components from the spectra produced by compute_spharm. We compared these methods to CAJAL using the same metrics described above (subheading "Comparison of CAJAL with current methods for neuronal morphometry").

Evaluation of the accuracy and runtime of CAJAL as a function of the number of sampled points
We sampled 25, 50, 75, 100, and 200 points from each cell from the patch-clamp dataset of Gouwens et al. 20 and applied CAJAL as described above to compute the GW distance between cells. We used the Calinski-Harabasz score, the median-based statistic of Pincus et al. 30 , and the Matthews coefficient of a = 10 nearest neighbor classifier (subheading "Comparison of CAJAL with current methods for neuronal morphometry") to assess how the number of sampled points affects the ability of CAJAL to capture morphological differences between cells from different Cre driver lines. Runtimes were determined based on 12 threads of a desktop computer with an 8-core Intel Xeon E5-1660 3.20 GHz CPU.

Morphological analysis of the DVB neuron
We considered the neurite reconstructions of the DVB neuron from 873 adult male C.
elegans aged 1 to 5 days from control strains or strains containing mutations in the genes nrx-1,  Table   2). Reconstructions were created from confocal images of the DVB neuron using SNT 26 in Fiji 67 as previously described 49 . We computed the GW distance between these morphological reconstructions as described above (subheadings "Computation of GW cell morphology spaces" and "Processing of Patch-seq and patch-clamp morphological reconstructions"), based on the geodesic pairwise distance of 100 points sampled from each neuron. We then introduced an indicator function for each mutated gene, which took values 0 or 1 on each cell depending on whether the worm had a wild-type or a mutated version of the gene, respectively. To determine which of the 11 mutated genes were associated with changes in morphology, we computed the Laplacian score of each indicator function on the GW cell morphology space as described above (subheading "Evaluation of features on the cell morphology space"). To compute the score we used the R package RayleighSelection 47 with 1,000 permutations, equal to the median GW distance and the age of the worm in days as a covariate. In the same way, we used RayleighSelection to determine which of 33 morphological features computed with SNT were significantly localized in the cell morphology space. In addition, we performed the same analysis using only neurons from a single day, for each day, to determine the age at which the effect of significant mutations on the morphology of the DVB neuron starts to emerge.

Identification of genes and electrophysiological features associated with the morphology of neuronal dendrites
We used the same process described above (subheading "Processing of Patch-seq and patch-clamp morphological reconstructions") to compute the GW distance between the morphological reconstructions of the dendrites of 644 neurons profiled by Scala et al. 21 . We sampled 100 points from each dendrite and used geodesic distance to measure the distance between points. To determine which genes are associated with morphological variability we computed the Laplacian score of each gene on the GW morphology space using RayleighSelection, as described above (subheading "Evaluation of features on the cell morphology space"). Gene expression values were normalized as log(1 + 5000 •size-normalized expression), we used 1,000 permutations, and was given by the median GW distance. We only tested genes expressed in at least 5% and less than 90% of cells. We identified gene ontology enrichments using the R package gProfileR 68 , where we performed an ordered query of the significant genes based on their Laplacian score and restricted the search to biological process (BP) gene ontologies. We used the same procedure based on the Laplacian score to determine which electrophysiological features were associated with changes in the morphology of the dendrites.

Computation of RNA velocity trajectories
We clipped 3' Illumina adapters and aligned FASTQ files to the GRCm38 mouse reference genome using the STAR aligner 69 . We used the command "run_smartseq" from the velocyto command line tool 53 to create a Loom file of spliced and unspliced reads. We then used the scvelo Python package 54 in dynamical mode to compute the RNA velocity trajectories and project them onto the UMAP representation of the cell morphology space, using 20 minimum counts, 500 top variable genes, 10 principal components, and 30 neighbors. We computed the pseudotime using the velocity graph.

Consistency between transcriptomic, electrophysiological, and morphological spaces
We defined the transcriptomic distance ( ) between two cells as the Spearman correlation distance between their log-normalized gene expression profile, and their electrophysiological distance ( ) as the Euclidean distance between their electrophysiological feature vectors. We compared these distances and the GW morphological distance ( ) between all pairs of cells in the Scala et al. 21 dataset by representing them on a 2-simplex. For that purpose, we standardized the logarithm of pairwise distances independently for each data modality. We then took the axes of the 2-simplex to be the given by the difference between each pair of distances ( − , − , − ), so that the sum of the coordinates equals 0 for each pair of cells. We plotted cell pairs in the middle 98% of each axis.

Integrative analysis of Patch-seq and patch-clamp data
We combined the patch-clamp and two Patch-seq datasets into one cell morphology space by computing the GW distance between the morphological reconstructions of the dendrites of all 1,662 neurons from the 3 datasets. We sampled 100 points from each dendrite and used geodesic distance to measure distances between points.
To evaluate the integration of the Patch-seq datasets, we utilized the classification of neurons into the t-types of Tasic et al. 18 . This classification is provided by Gouwens et al. 19 as their transcriptomic alias, and we computed the classification for the dataset of Scala et al. 21 using their ttype-assignment Jupyter notebook. We tested the overlap between neurons of the same t-type but from different datasets in the cell morphology space by performing a Wilcoxon rank-sum test, comparing the distribution of GW distances within the same t-type with the distribution of GW distances between t-types.
To evaluate the integration between the two Patch-seq datasets and the patch-clamp dataset, we matched the neuronal t-types in the Patch-seq datasets with the Cre driver lines in the patch-clamp dataset. We used only the first marker in the t-types and considered markers that existed in at least five cells of two of the three datasets. This left Sst, Pvalb, and Vip as major markers between the t-types and Cre lines, and Lamp5 and Sncg as markers between ttypes only. We again used the Wilcoxon rank-sum test to compare the distributions of GW distances within and between these five major transcriptomic types.

Integrative analysis of Patch-seq and MICrONS neuronal data
We calculated a combined GW morphological space for the two Patch-seq datasets and 1,000 neurons evenly sampled from the MICrONS dataset, in addition to 140 manually annotated neurons by the MICrONS program. We sampled 50 points from the full neuronal reconstructions from the Patch-seq datasets. In the case of the Scala et al. 21 18 , we assigned all other Vip subtypes to bipolar cells. We then evaluated the consistency of the cell morphology space with these assignments by using a Wilcoxon rank-sum test to compare the distribution of GW distances between matching types across datasets with the distribution GW distances between non-matching types across datasets.

Code availability
The source code of CAJAL is available at https://github.com/CamaraLab/CAJAL.

Data availability
All the datasets used in this study are publicly available. The morphological reconstructions of the DVB neuron have been deposited in the neuromorpho.org database (Hart archive). The patch-clamp data of Gouwens et al. 20 can be accessed at the Allen Brain Atlas data portal (http://celltypes.brain-map.org/data). The Patch-seq datasets of Gouwens et al. 19   An optimal matching between the discretized morphologies of each pair of cells is established by computing the GW distance between their corresponding matrices. Computationally efficient approximations to the GW distance use optimal transport theory to establish a map between the distributions of landmark point pairwise distances in each cell. The value of the cost function at the minimum measures the amount of deformation that is needed to convert the shape of one cell in that of another. c) The GW distance matrix can be thought of as a distance ability of CAJAL to identify morphological differences between molecularly defined neurons is assessed in 3 datasets in comparison to 5 currently available methods. In this study, CAJAL offered substantially better or similar results compared to the best performing method in each dataset according to the three metrics that were used for the evaluation.