Super-resolution fight club: A broad assessment of 2D & 3D single-molecule localization microscopy software

With the widespread uptake of 2D and 3D single molecule localization microscopy, a large set of different data analysis packages have been developed to generate super-resolution images. To guide researchers on the optimal analytical software for their experiments, we have designed, in a large community effort, a competition to extensively characterise and rank these options. We generated realistic simulated datasets for popular imaging modalities – 2D, astigmatic 3D, biplane 3D, and double helix 3D – and evaluated 36 participant packages against these data. This provides the first broad assessment of 3D single molecule localization microscopy software, provides a holistic view of how the latest 2D and 3D single molecule localization software perform in realistic conditions, and ultimately provides insight into the current limits of the field.

Image processing software is central to single molecule localization microscopy (SMLM), which 47 delivers an order of magnitude resolution improvement on diffraction limited conventional 48 fluorescence microscopy, from 250 nm to approximately 20 nm resolution, by temporal separation of 49 fluorophores within a sample [1][2][3] . Efficient and automated image processing is essential to extract the 50 super-resolved positions of individual molecules from thousands of raw microscope images, 51 containing millions of blinking fluorescent spots. 52 Improvements in SMLM image processing algorithms have been crucial in maximizing spatial 53 resolution and in reducing the imaging time of SMLM for compatibly with live cell imaging 4-6 . If SMLM 54 is to achieve a resolving power approaching that of electron microscopy, the analysis software 55 employed needs to be robust, accurate, and performing at current algorithmic limits. This can only be 56 achieved through rigorous quantification of SMLM software performance. 57 The first localization microscopy software challenge was carried out in 2013, to enable robust 58 benchmarking of 2D localization microscopy software packages 7 . But biology is not just a 2D problem, 59 and a key focus of localization microscopy is the imaging of 3D imaging of nanoscale cellular 60 processes 8,9 . 3D localization microscopy is a more difficult image processing problem than 2D SMLM. 61 In addition to finding the center of diffraction limited spots to super-resolve lateral position, 3D SMLM 62 algorithms must also extract axial information from the image, usually by measuring small changes in 63 the shape of a point-spread function 10 (PSF). 64 There are roughly three common approaches for 3D SMLM. First, point spread function engineering, 65 where the axial asymmetry of the microscope point spread function (PSF) is increased by introducing 66 intentional aberrations in the system, ranging from simple astigmatism 10 to more complex PSF 67 manipulation such as the double helix PSF method 11 . Second, biplane or multiplane imaging, where 68 axial position is measured based on simultaneous measurement of PSF shape at two or more focal 69 planes 12 . Third, dual objective based interferometry, where Z-position is calculated from single photon 70 interference between opposing objectives 13 . Multiplane and PSF engineering methods typically obtain 71 axial resolutions on the order of 50 nm 10,11 . Interferometry achieves the best axial resolution, 10-20 72 nm 13 , but is not yet widely adopted. 73 Despite the widespread use of 3D localization microscopy, and challenging nature of 3D SMLM image 74 processing, the performance of software for 3D single molecule localization microscopy has previously 75 only been assessed for 2 or 3 software packages at a time, and without standard test data or metrics 14-76 17 . In the absence of common reference datasets and reliable assessment procedure of 3D software 77 performance, it is not possible to objectively assess how different software affects final image quality, 78 or which algorithmic approaches are most successful. Crucially, end-users cannot determine which 3D 79 SMLM software package and imaging modality is optimal for their application. 80 We therefore ran the first 3D localization microscopy software challenge, to assess the performance 81 of 3D SMLM software. We assessed software performance on synthetic datasets for three popular 3D 82 SMLM modalities: astigmatic imaging, biplane imaging and double helix point spread function 83 microscopy. We also assessed astigmatism software performance on two real STORM datasets. We 84 ran a second 2D localization microscopy software challenge, to reassess the 2D SMLM software state-85 of-the art on new, tougher, more realistic datasets. 86 Our simulations incorporate experimentally acquired point spread functions for maximal authenticity, 87 used signal and noise levels based closely on common experimental conditions, and incorporated a 88 realistic 4-state model of fluorophore photophysics 18 . Our synthetic data was designed to mimic two 89 common classes of cellular structure: narrow line-like microtubules (MT) and larger tubes similar to 90 the endoplasmic reticulum (ER) or mitochondria. Our simulations also included conditions with low 91 density (LD) of active fluorophores, used experimentally to obtain maximal resolution, and with high 92 density (HD) of active fluorophores, used experimentally for fast or live cell imaging. 93

94
Competition design 95 We established a large committee from within the SMLM research community, including  96  experimentalists and software developers, to define the scope of the challenge, ensure realism of the  97  datasets and define analysis metrics. We further opened this discussion to the whole community,  98 through an open forum, discussing best practices for the implementation of this contest 19 . 99 Thirty-six software packages have been entered in the competition thus far, including four packages 100 used in commercial software (Table S1, Supplementary Note 5). Excitingly, participation in the 101 competition actually led at least 8 teams to their software to support additional 3D SMLM modalities, 102 showing how competition fosters microscopy software development. 103 In 2016, we ran a first round of the 3D SMLM competition with explicit submission deadlines, with 30 104 competitor teams, culmination in a special session at the 6th annual Single Molecule Localization 105 Microscopy Symposium (SMLMS 2016). Since then, the challenge has been opened to continuously 106 accept new entries. We have had 12 new registrations of which 5 have submitted localizations, 107 including a multiple best-in-class performer (SMAP-2018 20 , an updated version of previously entered 108 software) demonstrating the utility of the competition as an evolving measure of the state of the field. 109

110
Testing super-resolution software on experimental data lacks the ground truth information required 111 for rigorous quantification of software performance. Therefore, realistic simulated 3D SMLM datasets 112 are required. After comparison of simulated microscope PSFs with multiple experimental PSFs from 113 SMLM microscopes around the world, we observed that a critical challenge to realistic 3D SMLM 114 simulations was to accurately model the experimental microscope PSF for each 3D modality. Even 115 experimental 2D PSFs showed significant aberrations away from the focal plane ( Fig S10). 116 3D SMLM inherently involves addition of aberrations to the microscope PSF to encode the Z-position 117 of the molecule. For the PSF models included in the competition: 2D, astigmatic (AS), double helix 118 (DH), and biplane (BP), we observed that the PSFs showed complex aberrations not well described by 119 simple analytical models ( Fig S10). We thus combined experimental 3D PSFs with simulated ground 120 truth by performing simulations using PSFs directly derived from experimental calibration data (Fig 1,  121 Methods). The experimental PSFs used to generate the simulated data are available online (Methods). 122 As the goal of this study was to compare software obtained on typical SMLM microscopes, we 123 deliberately chose PSFs representative of common implementations of each 3D modality. However, 124 additional PSF engineering should improve results of any specific modality, for example adaptive-125 optics corrected astigmatism 21 , or reduced Z-range, higher SNR DH-PSF designs 22 . 126 For the 3D competition, we simulated synthetic 25 nm diameter microtubules (Fig 1). For the 2D 127 competition, in addition to synthetic microtubules (MT), we simulated larger diameter 150 nm 128 cylinders, designed to approximate larger cellular structures such as mitochondria and the 129 endoplasmic reticulum (ER) (Fig 1). We incorporated a 4-state model of fluorophore photophysics, 130 including a transient dark state (dye "blinking") and a bleaching pathway ( Fig S1C). 131 As performance at different density of active emitters is a key challenge for SMLM software, we 132 generated 3D competition datasets at both sparse emitter density (0.2 mol. [molecule] μm -2 ) and high 133 emitter density (2 mol. μm -2 ). We additionally generated a very high density dataset (5 mol. μm -2 ) for 134 the 2D competition. 135 We generated data at three different signal-to-noise ratio (SNR) levels, based on real signal to noise 136 levels encountered under common SMLM experimental scenarios: fixed cells antibody labelled with 137 organic dye 10 , fluorescent protein labelling 1 , and live cell affinity dye labelling 23,24 . 138 Together, these simulations closely resemble experimental 3D and 2D data under a range of 139 challenging conditions of SNR, spot density, axial thickness and structure summarized in Table S2. In 140 addition, we provide simulated z-stacks of bright beads for software calibration. The competition 141 datasets are available online (Methods). 142 Quantitative performance assessment of 3D software 143 We assessed software performance by 26 quality metrics (Supplementary Note 1). The complete set 144 of summary statistics, axially resolved performance and super-resolved images is available for each 145 competition software on the competition website. We built an interactive ranking and graphing 146 interface that allows easy ranking and graphing of software performance by any metric, including new 147 user defined metrics ( Fig S11). Detailed individual software reports can also be accessed, along with 148 a tool for side-by-side comparison of software (Fig S11, S22). 149 We focused our analysis primarily on metrics directly derived from single molecule localizations. 150 Choice of ranking metric is discussed in detail in Supplementary Note 1.6, where several alternative 151 ranking metrics are also presented. For ranking purposes, we developed a single summary statistic for overall evaluation of software 165 performance, which we term the efficiency (E), encapsulating both the software's ability to find 166 molecules, measured by the Jaccard index, and the software's ability to precisely localize molecules. 167

168
The trade-off between these two metrics is controlled by a parameter α. In a retrospective analysis, 169 we chose α = 1 nm -1 for the lateral efficiency Elat, α = 0.5 nm -1 for the axial efficiency Eax, based on the 170 linear regression slope between the localization errors and Jaccard index ( Fig 17J-K). Using this 171 definition, an average software performance has an efficiency in the range 25-75, a perfect software 172 would have the maximum efficiency of 100. Overall 3D efficiency was calculated as the average of 173 lateral and axial efficiencies. Overall software rankings (Fig 2) were calculated as the sum of rankings 174 for high and low SNR datasets. 175 Performance of 3D software 176 Complete rankings for each imaging modality and spot density are presented (Fig 2), together with 177 summary information on all competition software (Table S1, Supplementary Note 1). As these data 178 are continuously updated on the competition website, this resource provides microscopists with a 179 quick reference for the current state of the art, including current best-in-class performers for each 180 category. 181 After assembling an overall summary of best performers for each competition category, we 182 investigated the performance of software within each imaging modality. 183

Astigmatic localization microscopy 184
Astigmatic localization microscopy is probably the most popular imaging 3D SMLM modality, reflected 185 by the highest number of software submissions in the 3D competition (Fig 2). For astigmatism, we 186 observed a large spread of software performance, even for the most straightforward high SNR, low 187 spot density (LD) conditions (Fig 3, Table S5). The best-in-class software (SMAP-2018) has significantly 188 better localization error and Jaccard index performance than average (lateral RMSE 26 nm best vs 38 189 nm average, axial RMSE 29 nm best vs 66 nm average, Jaccard index 85 % best vs 74 % average). 190 Clearly, the quality of the image reconstruction depends strongly on choice of 3D software. 191 To investigate the reasons for software variation, we inspected plots of software performance as a 192 function of axial position in the low density, high SNR dataset for best-in-class and representative 193 middle-range software ( Fig S7A). We observed that the key cause of the spread in software 194 performance is variation in software performance away from the focal plane. Near the focal plane, 195 most software packages perform well. However, the axial and lateral RMSE away from the plane of 196 focus is significantly higher for the best in class software, and the Jaccard index is also slightly improved 197 ( Fig S7A). This is also visibly apparent in the super-resolved images ( Fig 4A). We observed that best-in-198 class software had a Z-range (the FWHM range of axially resolved software recall, Methods) of 1170 199 nm, greater than two-thirds of the simulated range. Outside this range, the recall and Jaccard index 200 dropped sharply, probably due the large increase in PSF size and decrease in effective SNR at 201 significant defocus ( Fig S10). 202 When we examined results for the low SNR, low density dataset (Fig 2B, 3B), we found an expected 203 two-fold degradation in best-in-class RMSE (lateral RMSE 39 nm, axial RMSE 60 nm), due to the 204 decrease in image SNR. However, the best-in-class software (SMolPhot 26 ) Jaccard index was effectively 205 constant between the low and high SNR datasets (86 % vs 85 %), although the Z-range did drop at 206 lower SNR (930 nm vs 1120 nm). The best astigmatism software packages were thus remarkably good 207 at finding spots at low SNR, even away from the plane of focus. 208 We analyzed how close software performance was to theoretical limits by calculating the Cramér-Rao 209 lower bound (CRLB) as a function of axial position for each dataset and comparing it to the best-in-210 class software results ( Fig S8, S9, Supplementary Note 4). Close to the focus, best-in-class software 211 was close to CRLB performance (within 25 %), but significant deviations for the CRLB limit occurred > 212 200 nm. This could be due to the difficulty in actually detecting the spots away from focus. 213 When we examined astigmatic software performance for the challenging high spot density datasets 214 (Fig 2B, 3), performance was reduced. For the high SNR high spot density dataset (best software, 215 SMolPhot), localization error increased and Jaccard index decreased significantly compared to the low 216 density condition (lateral RMSE best HD 51 nm vs best LD 27 nm, axial RMSE best HD 66 nm vs best 217 LD 29 nm, Jaccard index best HD 66 % vs best LD 85 %). Inspection of the super-resolved images 218 ( Fig S3) nevertheless shows acceptable results for the HD dataset, particularly in the lateral dimension. 219 In many circumstances, the performance reduction at 10x higher spot density should be acceptable 220 for 10x faster, potentially live-cell-compatible, imaging speed. We also observed a large spread of 221 software performance for the high density datasets, probably because a significant fraction of the 222 software packages were primarily designed for low density conditions. 223 We observed poor performance for the most challenging low SNR high spot density astigmatism 224 dataset (Fig 2, 3, S4, best software SMolPhot). Best-in-class localization precision and Jaccard index 225 decreased significantly (lateral RMSE 76 nm, axial RMSE 101 nm, Jaccard index 58 %). These data 226 suggest that low SNR high density 3D astigmatic localization microscopy entails a significant reduction 227 in image resolution. 228 We next analyzed the performance of the double helix software (Fig 3B, S14A). For the software in the 230 high SNR low spot density condition, double helix software showed more uniform performance than 231 astigmatism. Best-in-class software (SMAP-2018) showed only a limited improvement compared with 232 average software (Fig 3B, lateral RMSE, 27 nm best vs 37 nm average; axial RMSE 21 nm best vs 34 nm  233 average; Jaccard index 77 % best vs 73 % average). In general software localization performance was 234 close to the CRLB (Fig S8, S9). We observed that performance of the software away from the focal 235 plane is relatively uniform (Fig 4A, S7A), and best-in-class Z-range at high SNR was large at 1180 nm 236 ( Fig S7, Table S5). Double helix imaging may show less software-to-software variation and large Z-237 range at low spot density than astigmatic imaging because the PSF shape and intensity are fairly 238 constant as a function of Z -compared to astigmatic imaging, where spot size, shape and intensity 239 vary greatly as a function of Z ( Fig S10). 240 Double helix software performance decreased significantly for the low spot density low SNR condition 241 (best software SMAP-2018), particularly in terms of best-in-class Jaccard index (66 % low SNR vs 77 % 242 high SNR, Fig 3B, S4, S14A). DH Jaccard index was also significantly worse than astigmatism results at 243 either high or low SNR (85 % high SNR, 86 % low SNR). This indicates that it was quite hard to 244 successfully find localizations in the low SNR DH dataset, likely because the large size of the DH PSF 245 spreads emitted photons over a large area, lowering effective image SNR. DH PSF designs with reduced 246 Z-range but more compact PSF would likely be less sensitive to this issue 22 . 247 Double helix software performed poorly on the high spot density datasets at high SNR (best software  248 CSpline 27 ), especially in terms of the Jaccard index (Fig 3B, S14A, best lateral RMSE 67 nm, best axial 249 RMSE 69 nm, best Jaccard index 46 %). The poor performance at high spot density is again probably 250 because the large DH PSF size increases spot density and decreases SNR (Fig S10). DHPSF performance 251 at high spot density and low SNR was also not reliable ( Fig. 3B, S14A, best software SMAP-2018). 252

Biplane localization microscopy 253
Best-in-class biplane software (SMAP-2018), at low spot density and for both high and low SNR, 254 delivered the best performance in any modality (high SNR: lateral RMSE 12.3 nm, axial RMSE 21.7 nm, 255 Jaccard 87 %), despite a slightly decreased image SNR for the biplane simulations (Methods). We 256 observed a significant spread in software performance in terms of lateral RMSE and Jaccard index, 257 with the best-in-class software significantly outperforming the other competitors (Fig S14B, 2D). At 258 low spot density, best-in-class biplane software (SMAP-2018) showed good performance as a function 259 of Z, with high Jaccard index over almost the entire Z-range of the simulations, and with a Z-range of 260 1200 nm at high SNR ( Fig S7, Table S5). The axial RMSE was relatively uniform as a function of Z and 261 close to the CRLB limit ( Fig S7). As axial and lateral RMSE are both averaged over the entire Z-range, 262 the strong biplane results arise from good performance across a large Z-range ( Fig S7). 263 At high spot density and high SNR, best-in-class biplane software (SMAP-2018) showed acceptable 264 super-resolved performance (Fig 3B, S3, S14B, best lateral RMSE 43 nm, best axial RMSE 49 nm, best 265 Jaccard index 61 %). Uniquely among the 3D modalities, best-in-class biplane software also gave 266 acceptable performance at high spot density and low SNR (Fig 3B, S3, S14B, best lateral RMSE 55 nm, 267 best axial RMSE 72 nm, best Jaccard index 61 %, best software SMAP-2018). 268 Performance of 2D software 269 Alongside the 3D challenge, we ran a second edition of the 2D localization microscopy software 270 challenge 7 to assess how the latest 2D software performed on more challenging, more realistic 271 datasets, and to provide an assessment of how the field had progressed since the last challenge. We 272 used the new simulation software, including an experimentally derived PSF and a realistic blinking 273 model, and also simulated a very high spot density condition (5 molecules/ μm 2 ). We created a more 274 spatially extended test structure, "pseudo-endoplasmic reticulum" (pseudo-ER), composed of 150 nm 275 diameter hollow tubes, to avoid artefacts due to 1D simulated structures 28 . We generated two 276 different imaging conditions with overall similar SNR but different brightness properties; one with low 277 fluorophore brightness and low autofluorescence (the low SNR condition for the 3D challenge, 278 designed to simulate fluorescent protein based SMLM, Fig S5) and one with high fluorophore 279 brightness and high autofluorescence (to simulate affinity-dye-based live cell SMLM, Fig S6). We used 280 lateral RMSE, Jaccard index and overall lateral efficiency to rank the 2D software (Fig 2, S2, Table S1). 281 For the pseudo-ER dataset, at low density, best-in-class software (ADCG) performed well (Fig. S2, S5), 282 with a Jaccard index of 90 % and lateral RMSE of 31 nm, substantially better than the class average 283 (Jaccard index 72 %, lateral RMSE 36 nm). Low density results for the dimmer fluorophore 284 microtubules dataset were similar to the brighter pseudo-ER dataset (Fig S2, best software SMolPhot). 285 For the very high density 2D dataset, which had 25x higher spot density than the LD dataset, best-in-286 class software (ADCG) showed excellent performance, with Jaccard index of 75% and lateral RMSE of 287 45.5 nm (Fig S2). Best-in-class performance (ADCG) on the dimmer fluorophore data at high spot 288 density was also strong (Fig S2, best Jaccard index 70 %, best lateral RMSE 51 nm). 289 Algorithms 290 We identified several classes of algorithm participant software (Table S1): 291 1) Non-iterative software tends to regroup the pixels in the local neighborhood of the candidates, like 292 interpolation, center of mass (QuickPALM 29 ) or template matching (WTM 30 ). These (often older) 293 algorithms are fast but tend to achieve poor performance (Table S1). 294 2) Single emitter fitting software is usually built on a multi-step strategy of detection, spot localization, 295 and optional spot rejection. The detection step finds bright spots in noisy images on the pixel grid. The 296 selection of candidates is usually performed by local maximum search after a denoising filter. Others 297 rely on more complex algorithms like the wavelet transform (e.g., WaveTracer 31 ). We did not observe 298 software ranking to depend significantly on the choice of optimization scheme, least-square, weighted 299 least-square or maximum-likelihood estimator (Table S1). 300 3) Multi-emitter fitting software groups clusters of overlapping spots, and simultaneously fits multiple 301 model PSFs to the data. Typically, fitted spots are added to the cluster until a stopping condition is 302 met 4,5 . This leads to improved localization performance at high spot density, at the cost of reduced 303 speed. This class of software (e.g., 3D-DAOSTORM 14 , CSpline, PeakFit 32 , ThunderSTORM 33 ) was 304 amongst the top performers in each 2D and 3D competition category (Table S1). 305 As expected, single-and multiple-emitter fitting methods both performed well on low density data 306 (Table S1); apparently at the densities studied, exclusion of occasionally overlapping spots by single-307 emitter software is sufficient for strong performance; explicit multi-emitter fitting is not required. For 308 the 2D challenge, multi-emitter fitting showed a clear advantage over single emitter fitting at high 309 density (Table S1). Surprisingly however, well-tuned single-emitter fitting algorithms (SMolPhot, 310 SMAP-2018) outperformed multi-emitter algorithms for the 3D high density conditions. 311 4) Compressed sensing algorithms. One subset of these algorithms utilize deconvolution with sparsity 312 constraints to reconstruct super-resolved images 34-36 . Although deconvolution approaches can give 313 good results, they are limited by the necessary use of a sub-pixel grid; increased localization precision 314 requires smaller grid resolution, which must be balanced against increased computational time. 315 Recent approaches address this issue by localizing the point sources in a grid-less manner using an 316 alternating descent conditional gradient scheme under some sparsity constraint (ADCG 37 , SMfit, 317 SOLAR_STORM, TVSTORM 38 ). This software class consistently gave the overall best performance for 318 2D high-density (ADCG 37 1 st , FALCON 36 2 nd , SMfit 3 rd ). 319 5) Other approaches. Of the alternative algorithmic approaches used (Table S1), the annihilating filter-320 based method LEAP 39 gave good performance for biplane imaging (the only modality for which it was entered). Recently, we received the first challenge submission from a deep learning SMLM software 322 (DECODE); these promising preliminary results are available on the competition website. 323

Post-hoc temporal grouping 324
Because molecule on-time is stochastically distributed across multiple frames, a common post-325 processing approach to improve localization precision is to group molecules detected multiple times 326 in adjacent frames, and average their position 40 (Supplementary Note 3). Temporal grouping was used 327 by the top performers (including SMolPhot, MIATool 41 and SMAP-2018), and is visibly apparent as a 328 more punctate super-resolved image (Fig 4A). 329

Choice of PSF model 330
Most software used a variant of Gaussian PSF model. A few participants designed more accurate PSF 331 models (Table S1). Either diffraction theory was used (MIATool 41 , LEAP 39 ) or spline fitting of an 332 analytical function to the experimental PSF was adopted (CSpline, SMAP-2018). Although simple 333 Gaussian model PSFs were sufficient to obtain best-in-class performance for the 2D and astigmatic 334 modalities (ADCG 37 , PeakFit, SMolPhot), top results for the more optically complex biplane and double 335 helix modalities were exclusively PSF-modelling algorithms (SMAP-2018, CSpline, MIATool, LEAP). 336

Multi-algorithm packages 337
Several software packages take a Swiss army knife approach of integrating multiple optional 338 localization algorithms into one program, to be flexible enough to suit various experimental 339 conditions 20,33 . SMAP-2018 and ThunderSTORM achieved strong across-the-board performance 340 supporting this rationale. 341

Software run time 342
Software run time is an important parameter for ease of use, and to facilitate real time analysis. We 343 did not see any correlation between software localization performance (Efficiency) and software run 344 time (Fig S24A). We thus created an alternative ranking metric, Efficiency-Runtime, which gave 25 % 345 weighting to run time (Supplementary Note 1.7, Fig S24B). Many good performers in the efficiency-346 only ranking were relatively fast and thus retained good ranking (SMAP-2018, SMolPhot, 3D-347 DAOSTORM). Interestingly, two software packages highly optimized for speed gained top ranking in 348 this analysis: pSMLM-3D 42 and QC-STORM. 349

Diagnostic tools for software and algorithm performance 350
During our analysis, we frequently noticed common types of deviation between software results and 351 ground truth which were easily diagnosed by visual inspection of the super-resolved comparison 352 overlay of ground truth and observed localizations (Fig S19-20). This included not only obvious issues 353 of poor localization precision or spot averaging at high density, but also other problems such as a 354 common error of structural warping which significantly reduced software performance. On the 355 competition website, we provide detailed diagnostic software reports including multiple examples of 356 software performance on individual frames which should help developers to identify algorithm and 357 software limitations and maximize software performance (Fig S21-22). 358 Assessment of software performance on real 3D STORM data 359 We investigated the performance of a representative subset of astigmatism software on real STORM 360 datasets of well-characterized test structures, microtubules and nuclear pore complex, NPC (Fig 4B,  361 S15). This qualitative assessment was consistent with findings for simulated data. No performance 362 difference between single and multi-emitter fitters was observed, which is not surprising since spot 363 density in these datasets was low. Relatively poor software performance was immediately obvious as the top and bottom of the NPC (Fig 4B) or the hollow core of antibody-labelled microtubules (Fig  369  S15). 370 371 We performed the first broad evaluation of software for 3D single molecule localization microscopy, 372

DISCUSSION
to assess the state of the field and to allow non-specialists to determine the optimal software for their 373 experiments. 374 In order to provide a realistic assessment of 3D software performance we tested software on 375 simulations incorporating experimentally acquired microscope point spread functions. Our 376 experimental-PSF-derived simulation approach is readily adaptable to novel engineered 3D SMLM 377 PSFs 43 or to the PSF of individual microscopes. For instance, it would be possible to combine our 378 derived-PSF approach with the SMLM sample simulation tool SuReSim 44 in order to generate ultra-379 realistic synthetic data, which could then be personalized to each experimentalists sample and 380 microscope, to easily determine the blocker factors to maximal resolution, for a given experiment. 381 The strongest conclusion we draw from the 3D localization microscopy challenge is that choice of 382 localization software greatly affects the quality of final super-resolution data, even at "easy" high SNR, 383 low spot density conditions. Biplane performance was particularly dependent on software choice, with 384 only one software (SMAP-2018) achieving near-Cramér-Rao lower bound performance. Double helix 385 SMLM showed less sensitivity to choice of software than biplane, with astigmatic SMLM intermediate 386 between the two. The best software in each modality performed close to the Cramér-Rao lower 387 bounds over a wide focal range and successfully detected most molecules, even at low signal to noise. 388 Average software in all three modalities was significantly worse, with the obtained axial resolution 389 being particularly sensitive to software choice. 390 The second major conclusion of the 3D challenge is that localization software that explicitly includes 391 the experimental PSF in the fitting model gives a significant performance increase for 3D SMLM. For 392 the more optically complex biplane and double helix modalities in particular, the best results were 393 exclusively from software using PSF modelling approaches (SMAP-2018, CSpline, MIATool). This result 394 also highlights the need for experimental PSF modelling not only in SMLM software, but also 395 emphasizes the high degree of experimental realism required of SMLM simulations. The clear 396 performance advantage of experimental PSF modelling software in the 3D software challenge would 397 have been unobservable had it been run with a simple Gaussian PSF. 398 Of the different algorithm classes, well-tuned single-emitter and multi-emitter fitting algorithms (each 399 capable of dealing well with occasional molecule overlap) gave good results for low density 3D SMLM. 400 We also found that several software packages for astigmatic or biplane imaging gave adequate 401 performance for the challenging case of high molecule densities, as long as the image SNR was high. 402 Current software packages gave poor performance when molecule density was high and image SNR 403 was low. These results suggest that, at least with current algorithms, high density 3D SMLM 404 performance is mediocre at high SNR, and poor at low SNR. Surprisingly, multi-emitter fitting did not 405 show significant improvement over well-tuned single emitter fitting for the 3D high-density datasets; 406 this may indicate that significant potential for improvement remains in this category. 407 Many software packages did not apply temporal grouping 40 , resulting in reduced software 408 performance. Since temporal grouping is a simple step for maximum precision, we urge all software 409 developers to integrate this approach into their software as an optional final step in the localization 410 process. 411 The second 2D localization microscopy challenge provided the opportunity to reassess the state of the 412 field. The performance of best-in-class 2D software over a range of conditions, at both high and low 413 spot density, is excellent. The performance of the best-in-class software at high spot density (ADCG 37 )  414  was only moderately decreased compared with the low spot density results, with nearly identical  415  molecule detection performance, and a 30 % increase in localization error. Interestingly, the top three  416  performers in the 2D high density condition were all compressed sensing algorithms (ADCG 37 ,  417 FALCON 36 , SMfit). In low density 2D conditions, the best single-emitter, multi-emitter and compressed 418 sensing algorithms all gave comparable, excellent, performance. We speculate that performance in 419 this category may now be near optimal levels. 420 In future we plan to extend the SMLM challenge website and software into an open platform where 421 the assessment process is fully automated, and where new competition simulations and assessment 422 metrics can easily be created and contributed by the community. Scientific CMOS cameras are rapidly 423 becoming a major platform for single molecule localization microscopy 6 and it will be important to 424 include sCMOS simulations in future SMLM software assessments. Furthermore, there remain two 425 important classes of super-resolution microscopy for which software performance is crucial, but no 426 broad software assessment has yet been performed: fluorescence-fluctuation-based super-resolution 427 microscopies (e.g., 3B 45 , SOFI 46 , SRRF 47 ) and structured illumination microscopy 48 . 428 The results of this competition clearly demonstrate the formidable algorithmic performance of the 429 best 2D and 3D localization microscopy software. However, a key outstanding challenge that often 430 hinders adoption of new algorithms is that only a small subset of algorithms are packaged in, or 431 compatible with fast, well-maintained, user-friendly software packages, which include all stages of the 432 SMLM data analysis pipeline -analysis, visualization and quantification. One solution would be for the 433 SMLM software community to collectively adopt both a standard data format and a single software 434 platform for future software development, such as FIJI/ ImageJ 49 . Any new algorithm released in this 435 environment could be immediately and widely adopted by users, and easily integrated into existing 436 packages for SMLM analysis, visualization and quantification. 437 Both the 3D and 2D localization challenges remain open and continuously updated on the competition 438 website. This continuously evolving analysis of state of the art super-resolution software performance 439 provides a valuable resource to super-resolution microscopists, helping to ensure they use software 440 that gets the best out of hard-won data. It also provides SMLM software developers with a robust 441 means of benchmarking new algorithms against current state of the art. 442