LOMAR: LOcalization Microscopy Analysis in R

In single molecule localization microscopy data (SMLM), individual instances of a macromolecular complex come in the form of point sets. Particle averaging, which combines localization data from a large number of instances, is often used to overcome experimental noise and obtain a refined view of the underlying structure. However, SMLM point sets are often heterogeneous due to biological variations in the structure they represent and must be partitioned into groups with similar structure before averaging which calls for being able to compute structurally-relevant similarity measures between sets of points. Here we introduce LOMAR (LOcalization Microscopy Analysis in R) a software package for the R programming language that enables comparison of point sets through implementation of several point sets registration methods and similarity measures derived from topological data analysis. We demonstrate use of the package on real and simulated SMLM data of nuclear pore complexes.


Introduction
SMLM techniques enable the determination of a fluorophore's position with a resolution of a few nanometers and is increasingly applied to the characterization of macromolecular complexes inside cells such as the nuclear pore complex (Sabinina et al., 2021), DNA damage repair foci (Sisario et al, 2018) or endocytic structures (Mund et al., 2018). While more typical light microscopy techniques produce data in the form of arrays of intensity values (i.e. images), SMLM produces sets of coordinates (i.e. point sets) with some additional information such as channel, precision and number of detected photons. While such data can be converted to images and processed using standard image analysis methods, new software and methods are have been developed to deal directly with the point cloud nature of SMLM data. Most of these focus on the segmentation of individual instances of the macromolecular structure of interest by applying clustering techniques to the coordinates data (Wu et al., 2020).
Incomplete labelling of structures and the stochastic nature of SMLM data acquisition result in noisy point sets with missing data. To overcome this issue, combining large numbers of instances with particle averaging (also sometimes called particle fusion) methods is increasingly used to refine the characterization of underlying structures. Particle averaging assumes that combined instances represent the same compositional and conformational structure but this assumption is not satisfied when biological variations contribute to the heterogeneity of the point sets obtained by SMLM. Partitioning point sets into structurally homogeneous groups is therefore necessary before applying particle averaging. This however requires being able to evaluate structural similarity between point sets in a manner that is robust to noise and missing data.
To help with this challenge, we developed the LOMAR software package which focuses on 3D point sets comparison by implementing point set registration methods and similarity measures based on topological data analysis. LOMAR is written in the R programming language. We demonstrate use of LOMAR on the registration of real 3D SMLM nuclear pore data and on the task of clustering simulated 3D instances of nuclear pores.

Results
Data input LOMAR can ingest SMLM 2D and 3D point set data from character-delimited text files.
Various filters can be applied to select a subset of the points and selected points can be further grouped using either cluster membership information included in the input file or by applying the DBSCAN clustering algorithm. Point sets can also be read from a series of images in TIFF format for use for example when corresponding objects have been obtained by image segmentation methods. For visualisation, and sometimes processing, it can be advantageous to convert point setss to images. For this, LOMAR implements a histogram binning method and an estimated photon count method (Huang et al., 2011).

Point set registration methods
The objective of point set registration methods is to find correspondence between multiple point sets and identify the spatial transformation that optimally aligns them. These methods have a long history in fields where sensors are in use such as robotics (Pomerleau et al., 2015). However, many such classical methods are implemented in different and often noninteroperable specialised software which may limit their adoption for use with SMLM data.
LOMAR brings together a few of these methods into one package (table 1): iterative closest point (ICP, Besl and McKay, 1992), coherent point drift (CPD, Myronenko and Song, 2010), and joint registration of multiple point clouds (JRMPC, Evangelidis et al., 2014)  ICP is one of the most used registration methods due to its simplicity. The algorithm is a pairwise method that iterates between two steps: 1-given a rigid transformation (translation and rotation), assign correspondence between two point sets A and B by finding the closest points in A for all points in B, 2-given the points correspondence, find the best rigid transformation that aligns B with A. Although ICP is sensitive to starting conditions and noise, it is fast and often provides good enough alignments.
However, better results are often obtained with methods involving Gaussian mixture models (GMMs) which allow incorporation of additional information such as level of noise or point localisation uncertainty.
Among these, CPD is a widely used pairwise registration method. In CPD, a Gaussian mixture model (GMM) is constructed from point set A and points in set B are considered observations from this GMM. The GMM centroids are then moved to maximise the likelihood of point set B. A coherence constraint is imposed on the movement to preserve the structure of the point sets. Fuzzy point set correspondences and the inclusion of a noise factor contribute to CPD's relative robustness to noise and outliers. Additional information can be included in the form of a points correspondence weight matrix. However, to ensure convergence, the algorithm uses equal isotropic covariances which precludes incorporating localization precision in the covariances if it is anisotropic.
To remedy this situation, we developed WGMMreg, an algorithm to register two GMMs by minimising the Wasserstein distance between them. An algorithm called GMMreg has already been described (Jian and Vemuri, 2011) and is based on the minimization of the L2 distance between GMMs. However, use of the L2 distance discards information about the relative positions of the points. A natural alternative to compare probability distributions is the Wasserstein distance. It also has the advantage of including information about the relative positions of the points due to the computation of an optimal transport map. However, while the Wasserstein distance can be computed in closed form between two Gaussian distributions, the Wasserstein distance between GMMs (linear combinations of Gaussians) is intractable. In our WGMMreg implementation we leverage recent results on the computation of a discrete form of a Wasserstein-type distance between GMMs (Delon and Desolneux, 2020).
The computational cost of the registration methods described above increases with the amount of information included and in particular, our current implementation of WGMMreg becomes prohibitive for large point sets although this can be mitigated with downsampling.
In addition, when registering a large number of point sets, the above methods require computing registrations between all pairs of points and subsequently using iterative methods (for example similar to the Barton-Sternberg algorithm for multiple sequence alignment (Barton and Sternberg, 1987)) to obtain a global registration map (see e.g. Heydarian et al., 2021). This also clearly becomes computationally expensive as the number of point sets increases.
For increased performance when many point sets need to be registered, we also implemented a method for the joint registration of multiple point clouds (JRMPC, Evangelidis et al., 2014).
In this approach, the point sets are considered generated by a common GMM and both the mixture and the registration parameters are estimated via an expectation maximisation (EM) algorithm. In addition to the registered point sets, the algorithm returns the transformation associated with each point set and the final GMM. A form of model selection can also be applied by removing unsupported components of the GMM using a minimum message length approach (Figueiredo and Jain, 2002). To illustrate use of this method, we applied it to the registration of 356 NUP107 nuclear pores obtained by STORM (Heydarian et al, 2021). The JRMPC method produces a global registration revealing the 8 subunits structure (figure 1a,b) in less than 20 min using only one thread on a 3.6 GHz CPU. In contrast, Heydarian et al. report that the all-vs-all registration of the same data takes 2 h using parallelization on a GPU. With JRMPC, the resulting GMM can also provide information about the underlying structure and potential outliers can be detected from the high variance of the components they are assigned to. In the case of NUP107 whose primary structure consists of two rings, this identifies the between-rings components as outliers with the other components forming a double ring structure (figure 1b). Eliminating points associated with the high variance components improves the overall registration outcome (figure 1a, compare left and right panels). The joint registration of multiple point clouds can also be used to quickly produce a good initial state for iterative methods leading to potential gains in computation time.

Topological data analysis
In topological data analysis (TDA), a set of data points with a measure of distance between them is seen as representing an underlying topological space. Relevant topological characteristics can be extracted from such topological spaces using the tools of persistent homology (Edelsbrunner and Harer, 2010;Otter et al, 2017, Amézquita et al, 2020. Informally, homology counts the number of "holes'' of different dimensions in a topological space, i.e. the 0th homology group counts the number of connected components, the first homology group counts the number of (2d) holes (i.e. holes inside loops), the 2nd homology group counts the number of (3d) voids (i.e. bubbles). Persistent homology efficiently computes homology groups at multiple scales. For this, each point set is converted into a simplicial complex based on a proximity parameter t. A simplex is an n-dimensional triangle and a simplicial complex is a collection of simplices connected by following specific rules to ensure that the homology of the underlying topological space is preserved (figure 2a).
Recording when holes form and get filled over a range of values of t produces a persistence diagram (figure 2b). To compute simplicial complexes and persistence diagrams, LOMAR relies on the Dionysus library with code borrowed from the TDA package (Fasy et al., 2014).
With this approach, each point set of a SMLM data set can be represented by a persistence diagram that characterises the shape of its underlying structure. Because persistent homology relies only on the pairwise distance between points, the shapes captured by persistence diagrams are independent of position and coordinate system. Persistent homology is robust to noise in the sense that small perturbations to the points positions will minimally affect the persistence diagram. Outliers are of more concern since they can have a large effect on the persistence diagram. However, as they tend to be further away from other data points they can either be detected and removed beforehand or various techniques can be used to deal with this issue such as de-emphasizing low density regions, for example by the use of a distanceto-measure function (Chazal et al., 2018).

Similarity measures for point sets
There are several ways of deriving structurally-meaningful similarity between two sets of points with LOMAR. One is to make use of registration errors obtained from the application of registration algorithms such as those described above. Another is to view the point sets as discrete samples from probability distributions and use a suitable measure of similarity between probability distributions such as the Wasserstein distance. Point sets shape similarity can also be evaluated by comparing their persistence diagrams. Conceptually, the first two approaches capture shape similarity by evaluating how much displacement is required to align the point sets while the TDA approach could be seen as comparing the multi-scale "porosity" of the underlying structures. These approaches are therefore complementary.
For compatibility with many machine learning methods which require positive definite similarity matrices, LOMAR implements the sliced Wasserstein distance between persistence diagrams (Carrière et al., 2017) and the persistence scale-space kernel (Reininghaus et al, 2015). The sliced Wasserstein distance is an approximation of the Wasserstein distance that is efficiently computed as a sum of Wasserstein distances between 1D projections of the points and produces a negative definite distance matrix. The persistence scale-space kernel results from mapping the persistence diagrams to a suitable vector space and computing a heat diffusion kernel.

Clustering point sets
To illustrate the use of similarity measures for clustering SMLM point sets, we use simulated data inspired by the nuclear pore complex. Ground truth data is produced using the method from (Theiss et al. 2022) and its acquisition in a typical SMLM experiment is simulated using the SMAP software (Ries, 2020).
For TDA-based clustering, a persistence diagram of sublevel sets of the distance to measure function is computed for each point set then the matrix of (sliced) Wasserstein distances between persistence diagrams is computed. For registration-based clustering, the (symmetrized) matrix of registration errors obtained from the pairwise application of the cpd algorithm is used as a distance matrix.
Following (Huijben et al. 2021), the distance matrix is used to embed the point sets into a low dimension space using multidimensional scaling. Clusters are then recovered by Gaussian mixture modelling of the data points in this space. To evaluate clustering quality, we assign to each cluster the label of the class most represented in that cluster and compare assigned labels to the ground truth class of origin.
NPCs tagged on different subunits produce structures with different ring diameters or different distances between the two rings (Sabinina et al., 2021). To illustrate how such structures can be separated, we produced 3 classes of simulated NPC structures that differ by their average ring diameter or by the average distance between the two rings (figure 3a).
Using TDA, the three classes can be separated with an average accuracy of 91% (figure 3b) while registration-based methods fail to separate the two classes that differ by their ring diameter (figure 3c).
NPCs typically have rings with an 8-fold radial symmetry but a small proportion may have a