A framework for evaluating the performance of SMLM cluster analysis algorithms

Single molecule localisation microscopy (SMLM) generates data in the form of Cartesian coordinates of localised fluorophores. Cluster analysis is an attractive route for extracting biologically meaningful information from such data and has been widely applied. Despite the range of developed cluster analysis algorithms, there exists no consensus framework for the evaluation of their performance. Here, we use a systematic approach based on two metrics, the Adjusted Rand Index (ARI) and Intersection over Union (IoU), to score the success of clustering algorithms in diverse simulated clustering scenarios mimicking experimental data. We demonstrate the framework using three analysis algorithms: DBSCAN, ToMATo and KDE, show how to deduce optimal analysis parameters and how they are affected by fluorophore multiple blinking. We propose that these standard conditions and metrics become the basis for future analysis algorithm development and evaluation.


Results
Data simulation and pre-evaluation Data generated from SMLM experiments can exhibit a wide range of spatial organisation, owing to the underlying biology and sample labelling. This may include varied levels of non-clustered points, sparse data, large numbers of clusters and irregular cluster shapes. The intersection over union (IoU) computes a measure of performance based on spatial overlap of clustered areas in the analysed data (green) and the ground truth (magenta).
These properties impact both the optimal algorithm parameters and their performance. We simulated 9 different ground truth molecule arrangements in 2000 x 2000 nm regions (Table 1, Figure 2a-i, Online Methods). Scenario 1 -non-clustered single molecules seeded at completely spatially random positions (CSR; Figure 2a); Scenario 2 -20 clusters of 15 molecules per cluster with 50% of the total molecules being clustered (Figure 2b); Scenario 3 -20 clusters of 15 molecules per cluster with 20% of the total molecules being clustered (Figure 2c); Scenario 4 -20 clusters of 5 molecules per cluster with 50% of the total molecules being clustered (Figure 2d); Scenario 5 -100 clusters of 15 molecules per cluster with 50% of molecules being clustered (Figure 2e); Scenario 6 -20 elliptically shaped clusters aspect ratio 3:1 each of 50 molecules with 50% of the total molecules being clustered (Figure 2f); Scenario 7 -10 clusters with a distribution width of 25 nm and 10 clusters with a width of 75 nm, with 50% of the total molecules clustered (Figure 2g); Scenario 8 -10 clusters with 5 molecules per cluster and 10 molecules with 15 molecules per cluster, with 50% of the total molecules clustered (Figure 2h); Scenario 9 -10 clusters with 15 molecules per cluster , and a distribution with of 25 nm, and a further 10 clusters with 135 molecules and a distribution width of 75 nm, thus maintaining molecule density with increased size, with 50% of the total molecules clustered (Figure 2i). Clustering algorithms will produce erroneous clustering results when used on data that has little inherent clustering. Ripley's K-function was computed for all of the scenarios above (Figure 2a-i, bottom panels). All but Scenario 1 show significant deviation from a simulated curve for a completely random distribution (Figure 2b-i, bottom panels, red dotted line). Therefore, Scenario 1 is not classified as significantly clustered using this approach, and hence not appropriate for further cluster identification.

Performance metrics
We next analyzed the simulations with clustering algorithms and scored the results. Importantly, given the breadth of clustering algorithms and analyses available, selecting algorithms that differ in their approach, but require similar user guidance was important. To this end, we chose DBSCAN 28 (Supplementary Figure 2a), ToMATo (Supplementary Figure 2b) 25 , and kernel density estimation (KDE, Supplementary Figure 2c). These algorithms were chosen on the basis that they differ in the method by which they identify cluster assignments but have comparable parameters to be set by the user (Supplementary Figure 2). For example, DBSCAN determines cluster and cluster edges by the number of neighbours for a given point above a threshold, whereas KDE uses an image-based approach where the pixel information represents the kernel density and points are clustered using a threshold. All three algorithms require two user-defined inputs to be set in order to function: minPts and ε for DBSCAN, a threshold on density mode persistence and search radius for ToMATo, and kernel size (s) and density threshold for KDE. Clustering was performed for each algorithm by scanning the parameter combinations. This was repeated for 50 simulations within each scenario and the mean scores for each parameter combination calculated along with the variance. The optimal clustering parameters for each algorithm within each scenario and cluster metric are summarised in Supplementary Table 1.  (Fig. 3a), the performance of the three algorithms is shown for every combination of user analysis settings (Fig 3b-d). The data shows a broad maximum of scores across the parameter space; however, a rapid drop-off and high-variance region exists for small-scale parameter settings, meaning if the optimal settings are not known, it is safer to err on the high-scale side. Fig 3e shows the maximum scores (i.e., for the best user settings) for each of the algorithm. The full performance analysis for all algorithms against all scenarios assessed by both metrics is shown in Supplementary Figures 3-9. The best score obtained from the optimal combination of input parameters in each case is shown in Figure 4 and Supplementary Figure 10. The best performing algorithm can depend on the scenario. With high levels of non-clustered points (Scenario 3, Figure  4a) or many clusters (Scenario 5, Figure 4c and Supplementary Figure 5c), for example, KDE maintains higher scores than the other algorithms (Supplementary Figure 5c). Furthermore, KDE seems to deal well with scenarios with variable cluster size and density, i.e., Scenarios 7 and 9 ( Supplementary Figures 7 and 9).

Effect of multiple blinking on optimal clustering parameters
An inherent property of SMLM data is the presence of multiple points arising from a single molecule. In PALM data, approximate correction is possible by grouping localisations that appear close in space and time. These strategies are less for dSTORM data. Each ground truth fluorophore may therefore appear as a small cluster with membership related to the number of blinks and size related to the localisation precision. We tested the three cluster analysis algorithms' performance against simulated data in the presence of multiple emission events. Each molecule in the ground truth conditions was assumed to be stoichiometrically labelled with a single fluorophore, and that the probability of that fluorophore blinking followed a geometric distribution 20 set such that a single fluorophore would give 4-5 detections, on average. The optimal clustering parameters for each algorithm are summarised in Supplementary Table 1. The effect of this added blinking step on Scenario 2 is shown in Figure 5, together with the complete performance evaluation for all three algorithms. There is an obvious drop off in performance for compared to the case with no blinking. The stochastic nature of the blinking introduces greater heterogeneity into the data. However, DBSCAN is more robust to the effect than the other two algorithms. KDE and ToMATo have similar performance by the ARI measure, but ToMATo is superior if the area of detected clusters is the required output (Figure 5c, right). This is a nice illustration that the choice of algorithm does not only depend on the nature of the underlying distribution, but also on the biological question being asked -i.e., which cluster descriptors are required to be accurate. The full performance analysis is shown in Supplementary Figures 11-17 and the summary in

Discussion
SMLM produces data in the form of a pointillist set of localisation coordinates. A frequent goal is the cluster analysis of such data to extract quantitative information on molecular aggregation and to this end, a wide variety of algorithms have been deployed 11 . As development of new algorithms is ongoing, it is advantageous to have a framework in which the performance of these can be systematically evaluated. Here, we provide such an environment. Firstly, we propose a set of simulated point distributions with known ground truth clustering. We generated the same conditions with the presence of multiple blinking to provide more challenging conditions. To assess the performance, we deployed two metrics designed for this purpose, ARI and IoU, which test different facets of the analysis performance. The ARI scores the analysis based on the similarity of cluster membership between the output and the ground truth and is computed based on whether every pair of points shared the same cluster label in each case. The input is therefore only the ordered set of cluster labels. IoU uses a measure of spatial overlap between the ground truth and detected cluster areas and takes as input the x,y coordinates as well as the labels. To illustrate the applicability of both metrics, we demonstrated them with the results from three cluster analysis algorithms -DBSCAN, ToMATo and KDE. These algorithms are diverse, but all require a choice of parameters by the user. We tested the performance of the algorithms over a range of parameter combinations and scored the results using ARI and IoU. ARI typically returned higher scores than IoU for the majority of ground truth scenarios, without multiple blinking. This is because the convex hull for IoU relies on a small number of molecules at the periphery of the clusters. These are the points most likely to be misassigned by the algorithms. Thus, ARI is the most appropriate metric when clusters have a long tail to their molecule distribution. However, it is clear that the ARI metric operates best when the level of cluster overlap is low, and/or the size of the clusters is consistent as both these effects can lead to overlap of ground truth clusters. IoU is therefore a better measure of performance if heavy cluster overlap is expected. IoU is also more robust against multiple blinking than ARI due to the increased number of localizations along the edges of the clusters. Similarly, the emergence of new clusters, emanating from blinking monomers, adds new cluster indexes to the result classes thus decreasing the overall score, although the true clusters are classified correctly. Finally, since the IoU takes as input the actual x,y coordinates as well as the labels, the IoU metric itself is dependent on the cluster analysis and the underlying distribution. It may therefore be less useful to compare IoU scores between ROIs and scenarios than between algorithms. We anticipate the framework presented here can be used to evaluate the performance of present cluster analysis algorithms designed for SMLM, and to inform the development of future methodologies. Further, the basis of both metrics, i.e., point classification and geometric overlap, are easily extendable to 3D, thus widening the applicability of this framework.

Methods
Ground Truth Cluster Simulations. Simulations of ground truth molecule point patterns were generated using scripts written in R. For simulations based on multivariate normal distributions, the cluster centres are randomly generated within a square field of interest (here, equivalent to 2x2 µm). At each centre, molecules are then placed around it according to the random multivariate distribution. For symmetrical clusters, a single value of 25 nm is used for the standard deviation of the multivariate distribution (unless stated otherwise, Table 1), whereas for the elliptically shaped clusters two different standard deviation values, for the minor and major axes, respectively, 25 and 75 nm, are used. Further for elliptical clusters, they a rotated by a random angle around the centre of mass of the generated cluster. Each cluster generated within the simulations possesses a unique "cluster index" value, i.e., molecules from the same cluster will have the same index value. Background molecules are given the index value of "0". For each cluster scenario, 50 simulations were generated. The different parameter values used to generate the simulations in this work are summarised in the table below; Fluorescence Blinking Cluster Simulations. The positions of molecules/fluorophores in the ground truth cluster scenarios were used as the basis for simulating data which has multiple blinking and detection precision inherent in SMLM. The simulateSTORM.r script from the RSMLM package (available at: https://github.com/JeremyPike/RSMLM) was used to generate the blinking simulations 25 . Briefly, transition between the fluorescent on-and off-state were modelled using a geometric distribution 20,25 with probability of transition to the dark state set to 0.2, generating on average 4-5 fluorescent on-states, and thus, detections per molecule. Blinking was applied to all molecules in the simulations, thus single background molecules were also prone to blinking here. Detections owing to a single molecule will all be ascribed the index value of that molecule from the ground truth, e.g., if detections are generated from a ground truth molecule with a cluster index of "5", then the detections too will retain the cluster index of "5". The localisation uncertainty for each blinking event was determined using a normal distribution centered on the molecule position. Standard deviation for localisation uncertainty was set using a log-normal distribution with mean 2.8 and standard deviation 0.28 20 .
DBSCAN clustering and parameter scanning. The density based spatial clustering of applications with noise (DBSCAN) algorithm was implemented in R using the dbscan R package. For DBSCAN there are two parameters; e, which is the radius of search around each point, and minPts, which is the minimum number of neighbouring points within that radius for the point to be assigned to the cluster. Points within ε of clustered points but failing to fulfil minPts are designated the edge of the cluster. For DBSCAN parameter scanning, the ε (nm) and minPts threshold were varied. For epsilon values of from a minimum of 5 nm was used and stepped by 5 nm up to a maximum value of 100 nm (20 steps). For the minPts threshold a minimum value of 2 was used and stepped by 1 up to a maximum value of 50 (49 steps). Therefore, for each simulation, 980 total combinations of epsilon vs minPts threshold were performed, and the resulting indexing from each combination retained for further analysis.
ToMATo clustering and parameter scanning. Topological Mode Analysis Tool (ToMATo) algorithm was implemented in R using the clusterTomato function from the RSMLM library (available at: https://github.com/JeremyPike/RSMLM) 25 . For ToMATo parameter scanning, the search radius (nm) and birth density threshold were varied. For the search radius, a minimum of 5 nm was used and stepped by 5 nm up to a maximum value of 100 nm (20 steps). For the birth density threshold, a minimum value of 2 was used and stepped by 1 up to a maximum value of 50 (49 steps). Therefore, for each simulation, 980 total combinations of search radius vs birth density threshold were performed, and the resulting indexing from each combination retained for further analysis.
Kernel Density Estimation clustering and parameter scanning. Kernel density estimation (KDE) was performed using the kde2d function form the MASS R package. A 2D matrix was generated using the minimum and maximum dimensions from the simulation data, with each element in the matrix corresponding to a 1 nm 2 region. The simulation coordinates are then convolved with a 2D Gaussian kernel within this 2D matrix, and densities within each of these 1 nm 2 regions after convolution calculated. This 2D density matrix can then be thresholded according to a specific density value, and higher density regions above the cut off are considered "clustered". These regions are then used to assign points from the real data into clusters. For KDE parameter scanning, the 2D Gaussian kernel width and the density threshold were varied. For the kernel width values of a minimum of 50 nm was used and stepped by 50 nm up to a maximum value of 500 nm (10 steps). For the density threshold a minimum value of 0.1e -07 was used and stepped by 0.5e -07 up to a maximum value of 0.1e -06 (10 steps). Therefore, for each simulation, 100 total combinations of kernel size vs density threshold were performed, and the resulting indexing from each combination retained for further analysis.
Adjusted Rand Index Scoring. Cluster indexes results from the clustering and parameter scanning are used to compute the adjusted Rand Index (ARI). The ARI calculation was implemented in R using the function mclustcomp from the dbscan library (full ARI R script can be found here: https://github.com/DJ-Nieves/ARI-and-IoU-clusteranalysis-evaluation). Briefly, the Rand Index (RI) is a measure of the similarity between two sets of cluster indexes, and is calculated using; where + is the number of agreements between the ground truth and the cluster results, whereas + is number of disagreements between the ground truth and the cluster results.
The ARI corrects for the random chance of points being assigned to the correct clusters, and is calculated as follows; where $& is the number of elements in common between clusterings and , $ is the sum of the contingency table for row , and & is the sum of the contingency table for column .
For each parameter combination the clustering result (i.e., the cluster indexes) is compared to that of the ground truth simulation, and the ARI is calculated for that parameter combination. For fluorescent blinking simulations data, the clustering results are compared to the cluster indexes generated by the simulateSTORM algorithm as the "ground truth" condition. This is repeated for all parameter combinations to generate an ARI matrix.
Intersection of Union Scoring. Intersection of union (IoU) scoring was implemented in custom written script in R (https://github.com/DJ-Nieves/ARI-and-IoU-cluster-analysisevaluation). A convex hull was used to identify the molecule coordinates at the edge of each ground truth cluster. A filled polygon for each cluster was then generated within a binary image matching the limits of the data (pixel area = 1 nm 2 ). All cluster images were then added together to generate a single image, and then the image was flattened to generate again a combined binary image. This process was performed for the ground truth clustering as well as each of the clustering results from the cluster parameter scanning. IoU is calculated as follows;

= ℎ ℎ
For the calculation from our binary images, the ground truth image was added to the cluster result image, thus giving a single image where the overlapping pixels had a value of 2. The number of pixels with value > 0 were equal to the combined area (nm 2 ), whereas the number of pixels with value > 1 were equal to the area of overlap (nm 2 ). For fluorescent blinking simulations data, the clustering results are compared to and image generated form the convex hulls of cluster indexes generated by the simulateSTORM algorithm as the "ground truth" condition. This is repeated for all parameter combinations to generate an IoU matrix.

Data Availability
The simulation data used as the basis for this work is available for download at https://github.com/DJ-Nieves/ARI-and-IoU-cluster-analysis-evaluation without restriction. All other data are available upon request. R Code for calculating ARI and IoU for clustering results against a ground truth scenario is also available for download at https://github.com/DJ-Nieves/ARI-and-IoU-cluster-analysis-evaluation without restriction.