Universal inverse modelling of point spread functions for SMLM localization and microscope characterization

The point spread function (PSF) of a microscope describes the image of a point emitter. Knowing the accurate PSF model is essential for various imaging tasks, including single molecule localization, aberration correction and deconvolution. Here we present uiPSF (universal inverse modelling of Point Spread Functions), a toolbox to infer accurate PSF models from microscopy data, using either image stacks of fluorescent beads or directly images of blinking fluorophores, the raw data in single molecule localization microscopy (SMLM). The resulting PSF model enables accurate 3D super-resolution imaging using SMLM. Additionally, uiPSF can be used to characterize and optimize a microscope system by quantifying the aberrations, including field-dependent aberrations, and resolutions. Our modular framework is applicable to a variety of microscope modalities and the PSF model incorporates system or sample specific characteristics, e.g., the bead size, depth dependent aberrations and transformations among channels. We demonstrate its application in single or multiple channels or large field-of-view SMLM systems, 4Pi-SMLM, and lattice light-sheet microscopes using either bead data or single molecule blinking data.


SI Fig 1. Comparison of different PSF modelling methods
for estimating the PSF model of a single-channel system from experimental bead data.Bead data were collected by imaging 40 nm red bead at z positions from -1000 nm to 1000 nm, with a step size of 50 nm.Localization is performed on the bead data used for inverse modelling.In the last row, the pupil magnitude was modelled only by Zernike polynomials that are circular symmetric.

SI Fig 2. Comparison of different PSF modelling methods for estimating the PSF model of a ratiometric dual-color
system from experimental bead data.Bead data were collected by imaging 40 nm red bead at z positions from -1000 nm to 1000 nm, with a step size of 50 nm.Localization is performed on the bead data used for inverse modelling.In the last row, the pupil magnitude was modelled only by Zernike polynomials that are circular symmetric.(a) Test on simulated data.The data were simulated at z positions from -500 nm to 500 nm, with a step size of 50 nm.An astigmatism aberration of 0.5 and a bead size of 50 nm were used in the simulation (b) Test on experimental data.40 nm bead data were collected at z positions from -500 nm to 500 nm, with a step size of 50 nm.The cross-correlation based method was described previously in Li et al. 1 and its implementation in SMAP was used.

SI Fig 12.
Comparison of Zernike-based PSF modelling using uiPSF and Zola 3D.40 nm bead data were collected at z positions from -500 nm to 500 nm, with a step size of 50 nm.Here we show the localization performance on the bead data used for PSF modelling (56 bead data).For uiPSF, the vectorial PSF modelling method was used and the PSF model was estimated from 82 bead stacks.For Zola 3D, the imageJ plugin was used and the PSF model was estimated from a single bead stack (which performs better than using multiple beads).Here we tested three PSF models generated from three different beads.It shows that the localization bias using Zola 3D is bead dependent.

Supplementary Notes
In the supplementary notes we used 'learning' to stand for inverse modelling, not machine learning or deep learning.As in our method, the PSF models are based on physics models.The following conventions for symbols are used: bold for a vector, regular for a scalar or the length of a vector; symbol  in subscript denotes index and in exponential denotes a complex value.

Data preprocessing
Data preprocessing includes conversion of the raw camera pixel value to photon counts, segmentation and negative value correction.The pixel value from the camera output has a unit of analog-to-digital unit (ADU), conversion from ADU to photon count is obtained by  = ( raw − )/, where o is a constant offset value that can be estimated from the mean of the pixel values at no photon collection, and  is a gain factor quantified through gain calibration 2 .
Here we ignore the pixel dependent offset and gain that might exist for sCMOS cameras 3 and use a single-valued offset and gain for data collected from both sCMOS and EMCCD.We also found that when gain calibration is not accessible, setting  = 1 will not affect the inverse modeling results.

Segmentation for single-channel bead data
Segmentation is to select and crop the candidate bead images from the raw data.For each data stack, a maximum projection is applied along the z dimension to generate a high SNR image.The resulting image is then smoothed by a 2D Gaussian filter with a standard deviation of e.g. 2 (user defined) pixels.The local maxima within a region of e.g. 3 by 3 (user defined) pixels are found by applying a maximum filter of a kernel size of 3 by 3 pixels to the smoothed image.The local maxima with a peak value higher than a set threshold are selected as the coordinates of the candidate bead images.We used a peak threshold at e.g.20% (user defined) of the maximum value of the smoothed image.A bead image stack of given subregion size was cropped around the center of each selected coordinate.We used a subregion size of 20-30 pixels along each dimension.
To prevent estimation of negative background values during PSF learning, we subtracted a negative value from the cropped bead stacks, which is estimated from 0.001 quantile of all the pixel values of all bead stacks.

Segmentation for multi-channel bead data
For multi-channel data, the candidate bead coordinates for each channel are first selected independently as described above.Then a coordinate pairing procedure is applied to ensure that the coordinates from all channels associated with the same bead are paired together.The pairing procedure consists of two steps.First, find lateral shift between channels: 1) set channel 1 as the reference channel; 2) calculate the channel shift  from the average of coordinate positions  (a vector a two integers) of the current channel  and the reference channel  = 1,   = avg(  ) − avg( 1 ); 3) calculate the pairwise distance between   and  1 +  c ; 4) remove coordinates with a pairwise distance || larger than a set threshold; 5) repeat steps 2-4 five times and save the final channel shift; 6) repeat steps 2-5 for each channel.Second, pair the coordinates: 1) for each coordinate in the current channel, find its closest coordinate in the reference channel corrected by channel shift, pair the two coordinates if their distance is less than 5 pixels; 2) remove unpaired coordinates from all paired channels; 3) repeat steps 1-2 for all channels.

Segmentation for SMLM data
Segmentation for in situ PSF learning is to select and crop the candidate emitters from the raw SMLM data.For each frame, we first subtract a background image, which is estimated as the average of 100 frames after the current frame.This step is to remove unblinking emitters or large aggregates that can last for a long time, so that the remaining emitters are mostly from single fluorophores.Then a difference of Gaussian filter (i.e.bandpass filter) is applied to the background subtracted image  raw ,  filter =  raw (, ) ⊗ �, , 0.75  , 0.75  � −  raw (, ) ⊗ �, ,   ,   �, where  represents a 2D Gaussian function with standard deviations defined by   ,   , which are user defined parameters.The resulting smoothed image is then filtered by a maximum filter of a set kernel size and the candidate emitters are selected following the same method as for bead data.Usually, 10,000 to 50,000 emitters are selected for one dataset.Different from bead data, here each selected emitter is a 2D image instead of a 3D stack.The axial position of each single molecule emitter is unknown, while the axial step size within the 3D stack is known for bead data.This additional unknown parameter for each emitter complicates the in situ PSF learning.
For SMLM data from multi-channel systems a similar coordinate pairing process as described above was applied.However, the method for finding the channel shift is image-based instead of coordinate-based: 1) set channel 1 as the reference channel; 2) calculate the maximum intensity projection (MIP) image along the frame dimension for each channel; 3) calculate the channel shift from cross-correlation (CC) between the MIP images of the current channel and the reference channel.The CC image is set to have the same dimension as the MIP image; 4) the channel shift is the center of the MIP image minus the coordinate of the maximum pixel value of the CC image.5) repeat steps 3-4 for all channels.

Calculation of initial values
The initial position for each bead is The background value for each bead is estimated as where   (, , ) is the cropped 3D image stack of bead .G is a 3D Gaussian kernel with its standard deviations in ,  and  equal to   ,   , and   pixels respectively.⊗ denotes convolution.The minimum is taken over all voxels of the cropped and filtered region.The initial background for each bead is set to the median value of   from all bead data, 3) The initial photon value for each bead is estimated as The initial PSF model is a 3D array with each element set to 1/  2 with   the ROI size.For example, if the ROI size is   ×   = 21 × 21 pixels, the initial value is 0.0023.This assumes that the sum of each axial slice of the PSF model is one.In fact, for most data,   is around 15 to 31, we set the initial value to 0.002 for simplicity.

Calculation of the forward model
The PSF model at a given bead location (  ,   ,   ) can be calculated as follows, where  � is the OTF, equal to the 3D Fourier transform of the PSF model (, , ),  � = ℱ 3 �(, , )�. (2.6) The shifting phase is calculated from, where   ,   ,   are the Cartesian coordinates of the frequency space as given by, where   ,   and   are length of the PSF model in ,  and  directions.Considering the bead size, we modified the above equation to, (2.9) Here we model the bead as a sphere and () is the analytical Fourier transform of a solid sphere of radius  0 , where  is the Bessel function of the first kind and  is the is the spherical coordinate in the frequency space and is calculated from To ensure the summation of  blur is close to  shift , the () is normalized by its maximum value.
Considering the photon   and background   , the final forward model for each bead stack is (2.12)

Variable scaling
We used the L-BFGS-B method provided by the optimize package in SciPy for optimization.During optimization, the gradients of the loss function with respect to the variables are calculated for each iteration.However, the gradient values are at different scales for different types of variables.To ensure all variables can be updated at equal rate during each iteration, we scaled each type of variables by a given factor and replace the original variables in the forward model through variable substitution, for example, where   ,  , ,  , are scaled variables, with respect to which the gradient will be calculated, and   ,   ,   are scaling factors for the PSF model, intensity and background (see SI Tables for settings of scaling factors for different PSF models).Variable scaling (or parameter scaling) is critical in PSF learning, which ensures the optimization achieves global minimum and all variables can be optimized.

Calculation of the Loss function
Our loss function for voxel-based learning is constructed based on the mean square error (MSE) between the forward model and the bead data.To let the variables satisfy additional constrains, several terms were added to the loss function.The final loss function is calculated from, where  x are the weighting factors for each loss term.To gradually increase the importance of some constrains during iteration, an additional factor  is multiplied, which increases every iteration by  +1 = 1.1  ,  0 = 1.
Here we will explain every term explicitly.We used two MSE terms, where   is the number of slices in z for each bead data and  is from 2 to   − 1.From simulated data, we found that  1 induces a more accurate estimation of the emitter's photon, while  2 has less emphasis on bright bead data (which are often caused by bead aggregates) and results in better position estimation.Here we set  1 = 1 and  2 = 200, so that the two MSE terms have similar values.
To reduce the effect of data noise and to obtain a smooth PSF model, we calculated the first difference of the PSF model along z and define, where Δ = 1 pixel (equal to 1 frame of the bead stack) and  sm = 1.
As the PSF model is a finite 3D array, shifting of the PSF model in ,  and  will create abrupt value change at the edges especially along the axial dimension.Therefore, in the calculation of  1 and  2 , we ignored the first and the last axial slices in   and   .We then add a loss term to constrain those edge values of the PSF model, This term constrains the border values along the axial direction to its neighboring values along z.We note that although we only ignore the edge slices at  = 1 and  =   in the MSE calculations, our method can estimate axial shift >1 pixels.Here we set  edge = 0.01.
When sample drift is considered (see section 2.1.5), drift is equal to the  1 norm of all drift rates over the bead data, this term is to constrain the estimated drift rates to be close to zero, to avoid adding an arbitrary constant drift rate.
The next three terms serve to constrain the values of PSF model, photon and background to be positive, (2.18) Here min denotes element-wise minimum comparing each value in an array with zero.As default we set  Umin = 1,  smin = 1,  bmin = 1.
The last term  norm in the loss function is to constrain the sum of each axial slice of the PSF model to a constant value.This is because the total photon of a single emitter should remain constant at different axial positions due to energy conservation.Therefore, we define, This term is often used together with the estimation of  dependent photon fluctuation (see section 2.1.5),to account for photo-bleaching and fluctuation of illumination intensity during the data collection.Normally, we set  norm = 0.

Optional learning variables
The instabilities of the system, such as drift, illumination fluctuation and photo-bleaching, increase the variations in PSF model.To address this issue, our learning method provides optional variables that partially incorporate the system instabilities.
One option is to estimate the lateral drift during the collection of one bead stack.We define the bead's lateral position at each axial slice as, where  = 1,2 …   , and  , and  , are additional learning variables that define the drift rates along  and  for each bead stack.To apply the  dependent lateral drift, the obtained forward model is shifted slice wise through a 2D Fourier transform, Due to additional number of Fourier transforms, with drift estimation the learning speed slows down by ~10%.
Another option is to incorporate  dependent photon fluctuation.With this option, the intensity variable for each bead stack becomes a vector, including one variable  , for each axial slice.The initial value for  , are the same and equal to the initial value for   .The calculation for the forward model stays the same, except replacing   with  , .
With  dependent photon variable, it is possible to apply  norm in the loss function, therefore under this option one can set  norm = 1.However, we found that adding  norm sometimes leads to early termination of the optimization function.Therefore, we advise that when optimization terminates at iteration number less than 60, set  norm = 0.

PSF learning for a multi-channel system
For multi-channel learning, first a single-channel learning is performed for each channel.We selected the first channel as the reference channel, then a transformation matrix   (affine transform) of the reference channel to each target channel  was calculated from the estimated positions.The obtained matrix was set as the initial value of   .For multi-channel learning, variables include positions, detected photons and the optional drift rates from the reference channel, and backgrounds and PSF models from all channels, and the transformation matrices for all target channels (exclude the reference channel).The forward model of a multi-channel system calculates the forwardmodel of each channel as in single-channel learning, and the  and  positions for the target channel are calculated from where  is the index of the bead and  is the index of the target channel.As all channels share the same detected photon estimates, the relative intensity ratio between the target channel and the reference channel will be automatically incorporated in the learned PSF models.

PSF learning for a 4Pi-SMLM system
4Pi-SMLM (short as 4Pi in the following text) is a multi-channel system.To use the same framework of the multichannel learning described in section 2.2, we generated a single-channel learning procedure for the 4Pi system.

Single-channel learning of the 4Pi-PSF
Calculation of initial values.Each 4Pi bead data contains three z-stacks, collected at three different piston phases   = [2/3, 0, −2/3] applied by a deformable mirror (DM).We formulated our 4Pi-PSF model based on the IAB model 4 , where 4Pi-model is defined by Component  and  of each 4Pi bead data are obtained as follows, , where   represents th z-stack in th 4Pi bead data, and  0 and  1 are Fourier components of   along the  dimension.Note that the symbol  in subscript denotes index and in exponential denotes complex value.Then we have, Similar to single-channel learning of incoherent PSFs, we used  ,data instead of   to calculate the initial values of photon and background.The initial position (  ,   ,   ) and phase   of each 4Pi bead data are set to zero.
We also set the piston phase   as a variable to compensate residue errors in DM calibration.The initial value of   = [2, 0, −2] and all bead data share the same   .
The initial 4Pi-PSF model contains three 3D matrices, where we separate the real part and the imaginary part of the component  to avoid complex variables, and each element of those matrices are set to a constant value, (, , ) = 0.002, Therefore, for single channel learning of the 4Pi-PSF, the variables are   ,   ,   ,   ,   ,   , , Re(), Im(), and   .
Calculation of the forward model.First, we obtain the 4Pi-PSF model from  and  as follows, where   = 2   ⁄ describing the modulation feature of the 4Pi-PSF along the axial dimension.Thus,  has a slow modulation in .The modulation period   is a user-defined constant value, with a unit of pixel.  can be estimated from the modulation period of the 4Pi bead data, and we slightly adjust the value to minimize the loss.Here we used   equals to 260 nm/ and 290 nm/ for bead data measured from 600 nm and 676 nm channels respectively, where  is the step size in .
To shift the PSF model in x, y and z, we apply the same shift to both  and  as follows, (2.32) The remaining terms are defined the same as in learning incoherent PSFs.

Multi-channel learning of 4Pi-PSF
Multi-channel learning of the 4Pi-PSF follows the same framework of multi-channel learning of the incoherent PSF.It includes single-channel learning of the 4Pi-PSF from each of the four channels and initial calculation of the transformation matrices between the target channels to the reference channel.The variables include positions, phases, photons, piston phases and the optional drift rates from the reference channel, backgrounds, , real part and imaginary part of  from all channels, and the transformation matrices for all target channels (exclude the reference channel).

Learning of lattice light-sheet PSF
Learning of lattice light-sheet (LLS) PSFs is based on single-channel PSF learning.Due to the geometry of the LLS microscope, moving of the sample stage results in translating the bead in both z and one of the lateral dimensions, respective to the detection objective or the camera.Therefore, the forward model in LLS PSF needs to incorporate this lateral translation.The method is similar to adding drift to the forward model (see section 2.1.5).However, here instead of estimating the drift rate, a constant drift is applied to all bead data.This constant drift is termed as skew constants,   and   , and are equal to the translation (in pixels) of a bead along the x and y axis per axial slice.The skew constants are given by the user, which are related to the angle between the detection axis and the sample plane.As the lateral translation can be tens of pixels over the whole z stack, the raw bead stack can take ~60 pixels in one of the lateral dimensions.To reduce the PSF size, we deskewed the raw bead stack where each axial slice is shifted by integer pixels as follows, The deskewed bead data are used in the learning and a 2D shift is applied to the forward model to account for the skew effect, The learned PSF model is the deskewed PSF as seen from a conventional single-objective system.

Localization test
To evaluate the accuracy of the learned PSF model, we performed a localization test on the same bead data used for learning.Details of the localization algorithms are described in previous studies 1,5 .In brief, first we calculated the cubic spline coefficients of the learned PSF model, where each voxel contains 64 ×  channel coefficients, while for 4Pi-PSF, it is 64 × 3 ×  channel coefficients, where  channel is the number of channels; then a maximum likelihood estimation (MLE) is used to estimate the , ,  positions (and  for 4Pi-PSF) for each 2D bead image using the spline PSF model (see section 7).We evaluated the accuracy of the learned PSF model using the MSE of the axial localization, where   is the ground truth position and equals to the stage position, the subscripts  and  denote the indices of the beads and the axial slices in each bead stack respectively.For 4Pi bead data,  , = avg   ,, , where  is the index of the piston phases.  is a vector of  bead × 1 elements where  bead is the number of the bead data.

Outlier removal
Due to the presence of field-dependent aberrations of the imaging system and the aggregates within the bead sample, it is necessary to remove outlier beads.We used three criteria to identify outliers.The first one is based on the MSE between the forward models and the bead data, Where s for incoherent PSFs (e.g.astigmatism PSF) and 4Pi PSFs were calculated from, The third criterion is based on the estimated photons from the PSF learning.This criterion is to remove bead measurements that are too bright or too dim, and can be calculated from, where  contains the estimated photon   of each bead data.Both  and   are vectors of   × 1 elements.
The bead data are considered as outliers if their metric values of above defined criteria are greater than the user defined thresholds, for example,   > 3 or  , > 3 or  , > 1.5. (2.42) After outlier removal, the remaining bead data were used to re-learn the PSF model.The obtained PSF model and its corresponding spline coefficients (and the transformation matrices for multi-channel learning) from the second learning step will be used for localization of SMLM data.

Pupil-image based PSF learning
In this representation, pupil function is represented by two 2D matrices.

Calculation of the forward model
The PSF model is calculated from a Fourier transform of the pupil function, where ℎ�  ,   � is the pupil function and can be written as, where  and Φ are the magnitude and phase components of the pupil function, each of them is a 2D array.  is the apodization factor of the objective lens and is equal to where  med and  imm are the angles of the optical rays in the sample medium and the immersion medium respectively, which are limited by the numerical aperture (NA) of the objective lens.The   ,   ,   are the Cartesian components of the wave vector , with its magnitude  =  imm  ⁄ , where  imm is the refractive index of the immersion medium and  is the central wavelength of the emission filter.And  −2  (−  ) is the propagation term and accounts for the defocus phase.With pupil-based PSF learning, we can directly add the bead position (  ,   ,   ) to the pupil function and eliminate the edge effect from voxel-based PSF learning.And  denotes the stage positions and is a vector   × 1 elements.Here we used Chirp Z-transform based on Bluestein's algorithm to calculate the 2D Fourier transform of the pupil function, denoted as ℱ czt .The benefit of using the BS algorithm is that the size of the pupil function is not dependent on the pixel size of the camera and a pupil size of 64 × 64 pixels can represent a reasonable sampling in the pupil function.
To account for the effect of fluorescence dipole emission, pixilation, bead size, dispersion and chromatic aberration, the PSF model is blurred by, where () is the bead kernel in Fourier space as defined in Eq 2.10 and   and   are the standard deviation (in pixel unit) of a 2D Gaussian kernel in real space.By including the photon count and background of each bead stack, the final forward model for pupil-image based learning is

calculation of the Loss function
The loss function for pupil-image based learning is, where  is based on the log-likelihood function, assuming the converted pixel values (see section 1 Note that pure data terms such as   and   log(  ) are present for improved numerical stability to sum up smaller values.Since they are constant, they do not influence the position of the optimum.Here  sm is to smooth the pupil magnitude and is defined as, where Δ  , Δ  and Δ  are equal to one pixel in the Fourier space.The smoothness of the pupil phase is controlled by  norm , which constrains the sum of the in-focus PSF close to one, where  ideal is the unaberrated PSF with pupil phase equal to zero and  is the PSF calculated from the Fourier transform of the current pupil phase Φ.We found that without  norm , the learned pupil phase tends to have random noise pixels.Those random noise pixels will not affect the shape of the PSF model but will lower its total intensity 6 .The rest of the terms are the same as the ones for voxel-based learning.
The variables for pupil-imaged based learning are   ,   ,   ,   ,   , , Φ,   and   .The initial values of   ,   ,   ,   were determined as described in voxel-based learning (see section 2.1.1).The initial values for each element in  and Φ are 1 and 0, respectively.The initial values of   and   are user defined and set to 0.5 pixel as default.

Zernike-based PSF learning
The calculation of the forward model for Zernike-based PSF learning is like the one for pupil-image based PSF learning, except that we decompose the magnitude and the phase components of the pupil function into Zernike polynomials, where   are Zernike polynomials in Noll order.Here we estimate the coefficients  , and  Φ, for each Zernike polynomial.
As the expansion of pupil functions into Zernike polynomials has an effect of smoothing the pupil function, therefore in the loss function, we omit the smoothing and normalization terms (Eqs.3.8 and 3.9),  =  1  +  drift  drift + ( bmin  bmin +  smin  smin ). (3.11) The variables for Zernike-based PSF learning are   ,   ,   ,   ,   ,  , ,  Φ, and .The initial values  , and  Φ, are equal to zero, except the first coefficient of pupil amplitude,  ,0 = 1.The initial values of other variables were determined as described for pupil-image based learning.

Vectorial PSF model
To where   is the  component of the electric field at the pupil plane generated by the  component of the dipole moment of the fluorophore.The calculation of   is as follows, Where  is the angular component in the polar coordinate of the frequency space.  and   are electric field components in p and s polarizations relative to the incident plane at the sample space, where   ,   ,   are the Cartesian components of the dipole moments,   and   are the total transmission coefficients of p-and s-polarized light.The fluorescence light starting from the dipole emitter propagates through the sample medium, the coverslip and the immersion medium.In this three-layer system, we ignore the multiple reflections at the two interfaces, then   and   can be calculated from, where   and   are the Fresnel transmission coefficients of s-and p-polarized light travel from medium  to medium , where   and   are the refractive index and the light propagation angle in medium , and the subscript 1,2,3 denotes the sample medium, the coverslip and the immersion medium respectively.
The remaining calculations of the forward model and the definition of the loss functions are the same as in the pupilimage and Zernike based methods (sections 3.1 and 3.2).

Fourier domain PSF learning of multi-channel systems
The PSF learning methods based on the scalar or vectorial PSF models can also be extended to multi-channel systems.Similar as in the voxel-based PSF learning, the x and y positions of the beads between the target channel and the reference channel are related by a 3×3 transformation matrix.The learning variables are positions, photons and the optional drift rates from the reference channel, and backgrounds, the blurring factor, the pupil function or the Zernike coefficients from all channels, and the transformation matrices for all target channels (exclude the reference channel).

Fourier domain PSF learning for the 4Pi-SMLM system
The Fourier domain PSF learning methods can also be extended to the 4Pi-SMLM system.The learning framework is the same as in voxel-based 4Pi-PSF learning.

Single-channel learning of pupil based 4Pi-PSF
Pupil-based model including both pupil-image and Zernike based models.
Calculation of the forward model.The 4Pi system collects fluorescence emission from both lower and upper objectives, the coherent superposition of the electric fields (pupil function) from the two emission paths forms a coherent PSF, while the incoherent superposition of the two electric fields forms an incoherent PSF.The 4Pi PSF in each channel is a summation of the coherent and incoherent PSFs, where   is the coherent PSF, where ℎ 1 and ℎ 2 are pupil functions from the upper and lower emission paths, the term 2  ( −   ) is the defocus phase and   is defined in Eq. (3.1),   is the modulation phase of each PSF data and   =   +   is the piston phase applied by the deformable mirror plus a phase delay (  , see section 3.5.2) relative to the reference channel.  is the incoherent PSF, The factor  describes the modulation contrast of the 4Pi-PSF.When  = 1, the 4Pi-PSF is completely coherent, when  < 1, the 4Pi-PSF is partially coherent.Due to the broad fluorescence emission spectrum and the finite band width of the emission filter, the measured 4Pi-PSF is always partially coherent.
Learning of the 4Pi-PSF model involves learning of both upper and lower pupil functions, ℎ 1 and ℎ 2 , and they can be calculated from, The apodization term   is the same as previously defined.For Zernike-based learning,  1 ,  2 , Φ 1 and Φ 2 are expressed in terms of Zernike polynomials, where  can be 1 or 2 denotes upper and lower paths, and   can be   or Φ  .Here we set the piston phase in Φ 1 to zero, which is  Φ 1 ,0 = 0, and the relative phase delay between ℎ 1 and ℎ 2 is estimated from the piston phase in Φ 2 , which is  Φ 2 ,0 .
The obtained PSF model will also be blurred by a bead kernel and a Gaussian kernel as for the case of voxel-based 4Pi-PSF learning.

Calculation of the Loss function.
The loss function of pupil-image based 4Pi-PSF learning is, where  sm =  sm1 +  sm2 is to smooth both the lower and upper pupil functions, and  sm1 and  sm2 can be calculated from Eq. (3.8).The term  αmin is to constrain the modulation contrast  to be positive, The remaining terms are the same as in pupil-image based learning for incoherent PSFs (section 3.1).
The loss function for Zernike-based 4Pi-PSF learning is (3.24)

Multi-channel learning of the 4Pi-PSF in the Fourier domain
The multi-channel learning of a pupil-based 4Pi-PSF model is like the one for voxel-based learning.To account for the constant phase delay between each channel, we add   to Φ 2 in each channel, where  denotes the channel index, and the residual phase delay between each channel will be incorporated into the piston component of Φ 2 .Above definition of   assumes the phase delay increment by −/2 over the four channels.User can define the increment to be −/2 or /2 depending on their systems.The learning variables include positions, phases, photons, piston phases and the optional drift rates of the reference channel, and backgrounds, pupil functions or Zernike polynomials, blurring factor and the modulation contrast from all channels, and the transformation matrices for all target channels (exclude the reference channel).
For Zernike-based learning, the user can choose to link the Zernike coefficients between the four channels, note that we do not link the Zernike coefficients between the lower and upper paths, Where,  denotes the Zernike index,  the channel index,  the lower or the upper path.For the magnitude part, only the piston term is unlinked to account for intensity differences between the four channels.For the phase part, the piston phase, x and y tilts and defocus are unlinked, to account for the relative phase delay and the x, y and z shifts between the four channels.Although the x and y shifts between channels can be accounted for by the transformation matrices, we noticed that the relative x and y shifts between the upper and the lower objectives are also channel dependent.Therefore, we unlink the x and y shifts of the Zernike coefficients between the four channels.

Learning of field dependent PSFs
For high-NA objectives, field dependent (FD) aberrations become prominent with large field of views (FOVs).The PSF model within a FOV of 10 × 10 µm can often be treated as spatial invariant.However, in a FOV of 100 × 100 µm, a single PSF model sometimes can no longer represent all PSF patterns.Here we extend our method to learn a spatial variant PSF model across a large FOV.The forward model for learning the FD aberrations is based on Zernike-based PSF models (either scalar or vectorial model).For each Zernike coefficient, a 2D map is learned instead of a scalar value, where each pixel value of the map represents the aberration amplitude at the field location defined by the pixel coordinate.Therefore, in FD PSF learning, a set of 2D aberration maps were learned.The forward model was calculated similarly as in Zernike-based PSF learning (section 3.2), except that the Zernike coefficients are beaddependent and were calculated from a linear interpolation of the aberration map, Here   and   are the pixel coordinates of the th bead in the raw camera frame. map,, and  map,Φ, are the aberration maps for the  th Zernike polynomial of the amplitude and the phase parts of the pupil function, respectively.Each aberration map is a 2D matrix of  ×  pixels, where  is user defined, e.g. = 20 means the FOV is divided by 20 × 20 subregions.The larger the , the finer the aberration map, however, as the aberration maps vary smoothly across the FOV,  can be chosen so that each subregion takes ~10 × 10 µm.
The loss function for learning the FD PSF model is, where  sm is to smooth the aberration maps, And the rest terms are the same as the ones in Zernike-based PSF learning (section 3.2).

Learning of refractive index mismatch aberrations
Here we apply our learning algorithm to data from beads embedded in an agarose gel to extract the bead's axial position relative to the coverslip.Fluorescence beads were immobilized throughout the agarose gel and are imaged through an oil immersion objective.The refractive indices of the agarose gel and the immersion oil are 1.334 and 1.516.Therefore, the deeper the bead inside the gel, the larger the refractive index mismatch aberration.The forward model, e.g.vectorial model, to incorporate the index mismatch aberration is, The term  2( med   −    ) accounts for the index mismatch aberration, where  med and  imm denote the refractive indices of the sample medium and the immersion medium, respectively.  defines the stage translation relative to the nominal focal position, where the objective is focused at the coverslip.
Here we define   = 0 at the nominal focal position, and   > 0 when the objective is moved closer to the coverslip, equivalent to when objective is focused at beads away from the coverslip.
And   is the bead's axial position to the coverslip,  is a vector of   × 1 relative to   .During the data segmentation, a 3D bead stack was cropped around the 3D pixel coordinate of the bead.The pupil function ℎ�  ,   � was fixed and learned from the bead data collected with bead at the coverslip.Both   and   are estimation parameters during the learning.

Learning of in situ PSF models
In situ PSF learning is to extract the PSF model from raw SMLM data, i.e., camera frames with single blinking emitter patterns.Here our method is based on the vectorial PSF model.

Calculation of initial z values
As the PSF pattern varies with the emitter's axial position, learning of the in situ PSF model is dependent on the initial estimations of the emitter's axial positions.Here we localize all emitters using an initial PSF model.The localization method is based on MLE with a spline-interpolated PSF model (sections 2.5 and 7).Here we provide three options on estimating the initial PSF model.
The first option is to learn the initial PSF model from the bead data collected from the same imaging system.The resulting file can be set as an input parameter for the subsequent in situ PSF learning.This method can give an accurate initial z estimation; however, it requires collection of bead data.
The second option is to generate the initial PSF model based on a set of user defined Zernike polynomials.Here the user needs to select the Zernike polynomials that represent the dominant aberrations of the imaging system and set the approximate amplitudes to the selected Zernike polynomials.For example, most 3D-SMLM systems have astigmatism aberration, then the user can set the Zernike polynomial to 5 (Noll order) and its coefficient to 0.5 (radian, its absolute value quantifies the RMS wavefront error).Although this method requires no bead measurement, the user needs to know the dominant aberration of the imaging system.We found that the learning algorithm is quite robust to a wrong initial aberration if the tendency of the initial z positions of single molecules agrees with the real situation.However, the convergence might be a bit slower.
The third option requires minimum knowledge of the imaging system from the user, it will search through a set of Zernike polynomials to determine the dominant aberration.The user only needs to define the search range is within the lower order (5-21) or the higher order Zernike polynomials (22-45).During the searching procedure, a PSF model for each Zernike polynomial at an amplitude of 0.5 or -0.5 is generated and used for localizing all emitters.The median value of the log-likelihood ratios 7 (LLR) of all localizations is used as a quality metric of the PSF model.The PSF model that gives the highest quality metric will be used as the initial PSF model.The searching procedure takes typically 2-3 minutes.
In general, the second option is recommended over the third if the dominant aberration is known.

Partitioning of data
Due to overlapping emitters and emitters of low photon counts, many selected emitters are not useful for learning the PSF model.Here we selected emitters by partitioning them into small axial segments, and selecting a small number of emitters within each segment that has the highest log-likelihood ratio from initial localization results.The number of axial segments and the number of emitters in each segment are user defined, here we use 21-31 and 100-200, respectively.After partition, 2000-3000 emitters were selected and used for PSF learning.

Calculation of the forward model
The where   is the electric field from dipole radiation as defined in Eq. (3.13),  med and   are defined in Eq. (3.31), and   ,   ,   define the position of each emitter in the sample medium, in particular,   defines the distance of the emitter to the coverslip and is constrained to be positive.  is the stage position and defined in section 3.7,   is positive when objective is focused on emitters above the coverslip.Since no emitter is below the coverslip during SMLM data acquisition, we constrain   to be positive.
In the case of index matching or water dipping objective,  med =  imm , changing of   and   are equivalent, so   will be fixed to a user defined value.Independent of index matching, the initial value of   should be large enough to ensure   to be positive.
The pupil function ℎ�  ,   � can be expressed either as 2D matrices or in terms of Zernike polynomials.For common aberrations from the imaging system, the Zernike-based pupil function is sufficient, while for engineered PSFs where the applied phase variation is not smooth, such as Tetrapod and double-helix PSFs, learning of image-based pupil functions is necessary.Here, user can also choose to set the magnitude part of the pupil function to a unit circle, so that the learning process only learns the phase part of the pupil function.This is a good approximation for most imaging systems and can greatly improve the convergence especially for complex PSF models.Therefore, for most learning results, we fix the magnitude part of the pupil function.
Similar to learning from bead data, the PSF model is blurred with a 2D Gaussian kernel, Note that here we use 2D Fourier transform, as for in situ PSF learning,  is a 2D matrix instead of a 3D array like in Eq 3.4.We found that the estimated   and   are slightly smaller to that from the bead data.Although in situ PSF learning is expected to extract the PSF model from single fluorophores which are much smaller than fluorescence beads, the extra blurring effect still exists.Finally, including the photon and background of each emitter, the forward model is, The variables for in situ PSF learning are   ,   ,   ,   and   for each emitter,   and   ,   and pupil magnitude and phase,  and Φ , for learning image-based pupil functions, or the Zernike coefficients of the pupil magnitude and phase,  , ,  Φ, , for learning Zernike-based pupil functions.

Calculation of the loss function
The where  sm is for smoothing the pupil function and is the same as the one for bead data.However, for learning a pupil function with discontinuous phase variations,  sm is set to zero.

Iterative learning
Similar to learning PSF from bead data, one iteration of in situ PSF learning consists of initial learning, outlier removal and re-learning.However, for in situ PSF learning, after one learning iteration, the learned PSF model is biased towards the initial PSF model.Furthermore, when the PSF pattern is complex and the photon counts of the emitters are low, one learning iteration is not sufficient to converge to the correct PSF model.Therefore, for in situ PSF learning, multiple learning iterations were used: for the first iteration, the initial PSF model is determined as described in section 4.1.1,for the subsequent iterations, the initial PSF model is the learned PSF model from previous iteration.The minimum photon count, defined by the quantile of all photon counts, used in data partition is set to a user defined value, e.g.0.4, and decreases by 0.1 for each subsequent iteration until it reaches 0.2.For most tested data, 2 iterations were sufficient, but for the Tetrapod PSF generated from a phase mask, 4-6 iterations were required to converge (SI Fig. 16).

In situ PSF learning for multi-channel systems
For in situ PSF learning of a multi-channel system, a single-channel learning as describe above is performed.However, here in the single channel learning, no data partition is applied, this is to maintain a sufficient number of emitters.
After single-channel learning, an initial transformation matrix between each target channel to the reference channel is calculated.As the selected emitters after each single-channel learning might be different between all the channels, a coordinate pairing process is performed again (section 1.3).With the transformation matrices and the learned PSF model from each channel, a multi-channel localization is performed.Based on the localization results, outlier emitters are removed and the remaining emitters are partitioned as described above.The partitioned emitters are used for subsequent multi-channel PSF learning similar to the multi-channel learning from bead data.

In situ PSF learning of 4Pi-SMLM systems
Learning a 4Pi-PSF model from SMLM data is more complicated than learning the one from bead data.In SMLM data, there is no known sampling of the PSF in z and phase positions.To decouple the phase and z positions of a 4Pi-PSF, the forward model of the 4Pi-PSF is slightly modified.

Single-channel learning of in situ 4Pi-PSF
Calculation of the forward model.The 4Pi-PSF is still calculated as the summation of the incoherent and coherent PSF (section 3.5).The coherent PSF is calculated as, where The remaining terms are the same as the ones for learning from bead data (section 3.5).

Multi-channel learning of the in situ 4Pi-PSF
First the 4Pi-PSF model from each channel is generated based on the initial Zernike coefficients.A 4Pi localization is performed using the generated model where the x and y localizations between the four channels are not linked.This is because the transformation matrix is unknown and initially set to an identity matrix.Outlier emitters are removed based on the localization results.A single channel 4Pi-PSF learning is performed on the remaining emitters in each channel where the localization results of the   ,   , photon and background are used as the initial values.After single-channel learning, an initial transformation matrix between each target channel to the reference channel is calculated.Notice that for in situ 4Pi-PSF learning, there is no outlier removal step during the single-channel learning.After all single-channel learnings, the emitters are then partitioned (section 4.1.2) and a multi-channel learning similar to the one for bead data is performed on partitioned emitters.

In situ PSF learning of field-dependent PSF
The in situ PSF learning of a single-channel system can be easily extended to learn FD PSFs.The forward model is the same as Eq.(4.1), except that the Zernike coefficients of each emitter are calculated from a linear interpolation of a 2D aberration map (Eq.(3.27)).To ensure emitters are evenly sampled across the FOV, the data partition described in section 4.1.2is extended to 3D, where the FOV is also divided into   ×   segments, together with   axial segments.The number of segments in each dimension is user defined.The total emitters are partitioned into each 3D segment based on their log-likelihood values as described previously.
The loss function is  =  1 + sm  sm + ( bmin  bmin +  smin  smin +  zmin  min ), (4.12) where the smooth constraint  sm  sm is defined in Eq .3.29.The remaining process is the same as the one for in situ PSF learning of a single-channel system.

Calibration of the deformable mirror
Based on the work described by Antonello et al. 8 , we employed a rigorous methodology to accurately determine the influence function for each actuator in the DM (Boston Micromachines, M140A-35-P01).This ensures that we obtain reliable and precise measurements of the individual actuator's impact on the overall performance of the DM.To begin with, we constructed a Twyman-Green interferometer, a variant of the widely used Michelson interferometer, specifically designed for precise optical testing.This interferometer allowed us to measure the phase information induced by the DM in a reliable manner.To extract the phase information induced by the DM, we employed Fourier-based fringe analysis 9 , a powerful technique that enables the precise determination of phase shifts in interferometric measurements.By carefully analyzing the fringes produced by the Twyman-Green interferometer, we were able to accurately quantify the influence of each individual actuator in the DM.In this work, we assumed that the responses of the DM actuators are linear, which is a commonly accepted approximation in similar studies.This assumption allowed us to simplify the analysis and establish a linear model to represent the relationship between the actuator inputs and the resulting output phase   : = �     (5.1)where   represents the influence function of the -th actuator,   is the control voltage applied to the m-th actuator.A vectorized version of Eq. (6.1) is, Here, Ψ  is the discrete measured phase.The control signal V is a vector of size N  whereas Ψ  is a vector of size N  .The corresponding influence matrix Φ should be with a size N  ×   , each column of Φ is a squeezed version of an influence function.By sequentially applying different amplitudes to each actuator and capturing the resulting interference images, we can derive a series of output phases [Ψ 1 , Ψ 2 … Ψ  ] corresponding to their control signals Once the influence matrix Φ is established, we can efficiently determine the control signals required to achieve the desired wavefront shape by performing matrix calculations.This direct wavefront design approach, utilizing the DM influence functions, offers improved accuracy compared to the conventional method of projecting a theoretical Zernike phase onto the DM.To facilitate Zernike-based DM control, it is necessary to make the underlying assumption, where ℤ is a matrix whose columns are Zernike polynomials sampled over the phase measurement grid. is a Zernike coefficients vector with   components.Since the influence functions of the DM do not resemble Zernike polynomials, more Zernike polynomials (  >   ) are necessary to describe the influence functions to avoid encountering large approximation errors.Zernike control needs to calculate the mapping between the control signals V and the Zernike coefficients , which can be solved using linear regression.

CRLB calculation and analytical gradients
The CRLB is calculated using the inverse of the Fisher information matrix (Θ), which measures the amount of information that an observation (PSF) carries about the estimated parameters Θ.The Fisher information matrix is defined as 10     2( med   −    )  2�    +    � �.

Localization methods
For the localization step during the PSF learning, the data was analyzed with the cubic spline fitting method 1,5 .The cubic spline interpolation of a given PSF model is 4,11  where   and   are the expected photon number and measured photon number in the pixel , respectively.We used a modified Levenberg-Marquardt (L-M) algorithm 11 to minimize   2 for the parameter estimation.

Fig 3 .
Localization test of estimated voxel and Zernike based PSF model on simulated data of a single-channel system with astigmatism aberration.(a) Localization of simulated data which were not used during inverse modelling process.The bead size used in the simulation is 50 nm.The data were simulated at z positions from -500 nm to 500 nm, with a step size of 50 nm and 40 images per z position.(b) Ratio of the standard deviation of the localized positions and the average theoretical estimation precision (√) over 40 repeats per z position.A value ~1 indicates that we reach the CRLB.(c) Comparison of estimated PSF models with (+) or without (-) incorporating bead size in the forward model and their axial localization biases.PSF models were estimated from simulated data with a bead size of 100 nm.Localization was performed on simulated data with a bead size of 40 nm.The residue shows the difference between the PSF model and the simulated data with a bead size of 40 nm.(d) Same as c, except that the PSF models were estimated from simulated data with a bead size of 200 nm.Scale bars: 1 µm.SI Fig 4. Localization test of estimated voxel and Zernike based PSF model on simulated data of a 4Pi-SMLM system.(a) localization of simulated data which were not used during inverse modelling process.The bead size used in the simulation is 50 nm.The data were simulated at z positions from -500 nm to 500 nm, with a step size of 50 nm and 40 images per z position.(b) Ratio of the standard deviation of the localized positions and the average theoretical estimation precision (√) over 40 repeats per z position.(c) Comparison of estimated PSF models with (+) or without (-) incorporating bead size in the forward model and their axial localization biases.PSF models were estimated from simulated data with a bead size of 100 nm.Localization was performed on simulated data with a bead size of 40 nm.The residue shows the difference between the PSF model and the simulated data with a bead size of 40 nm.(d) Same as c, except that the PSF models were estimated from simulated data with a bead size of 200 nm.Scale bars: 500 nm.SI Fig 5. Estimation accuracy of voxel and Zernike based PSF model on simulated data of a single-channel system with astigmatism aberration.(a) Comparison of the ground truth and estimated values of the x, y, z, photon and background of each bead stack.(b) Comparison of the ground truth and the estimated PSF models.Scale bars: 1 µm.SI Fig 6.Estimation accuracy of voxel and Zernike based PSF model on simulated data with large photon fluctuation in z.(a) Comparison of the ground truth and estimated values of the x, y, z, average photon over z and background of each bead stack.(b) Comparison of the ground truth and the estimated PSF models.The last column shows an example comparison of one bead data and its corresponding forward model.(c) Example comparison of the intensity at the center pixel over z between the data and its forward model.Scale bars: 1 µm.SI Fig 7. Estimation accuracy of voxel and Zernike based PSF model on simulated data of a 4Pi-SMLM system.(a) Comparison of the ground truth and estimated values of the x, y, z, phase, photon and background of each bead stack.(b) Comparison of the ground truth and the estimated PSF models.Scale bar: 1 µm.SI Fig 8. Experimental localization test of estimated voxel and Zernike-vector based PSF model from a singlechannel system.(a) Localization of bead data which were not used during inverse modelling process.The bead size is 40 nm.The data were collected at z positions from -500 nm to 500 nm, with a step size of 50 nm and 40 frames per z position.(b) Ratio of the standard deviation of the localized positions and the average theoretical estimation precision (√) over 40 repeats per z position.(c) Comparison of estimated PSF models with (+) or without (-) incorporating bead size in the forward model and their axial localization biases.Voxel-based PSF models were estimated from 100 nm bead data.Localization was performed on 40 nm bead data.The residue shows the difference between the PSF model and the 40 nm bead data.(d) same as c, except that the PSF models were estimated from 200 nm bead data.Scale bars: 1 µm.SI Fig 9. Experimental localization test of estimated voxel and Zernike based PSF model from a 4Pi-SMLM.(a) Localization of bead data which were not used during inverse modelling process.The bead size is 40 nm.The data were collected at z positions from -500 nm to 500 nm, with a step size of 50 nm and 40 frames per z position.(b) Ratio of the standard deviation of the localized positions and the average theoretical estimation precision (√CRLB) over 40 repeats per z position.(c) Comparison of estimated PSF models with (+) or without (-) incorporating bead size in the forward model and their axial localization biases.Voxel-based PSF models were estimated from 100 nm bead data.Localization was performed on 40 nm bead data.The residue shows the difference between the PSF model and the 40 nm bead data.(d) Same as c, except that the PSF models were estimated from 200 nm bead data.Scale bar: 1 µm.SI Fig 10.Estimation of Zernike-based PSF model of a 4Pi-SMLM system from experimental bead data.Bead data were collected by imaging 40 nm red bead at z positions from -500 nm to 500 nm, with a step size of 50 nm and three phase positions at -2π/3, 0, 2π/3 were collected at each z position.(a) Localization on bead data which were used for inverse modelling.Here the z position is converted from estimation of the phase.(b) Estimated Zernike coefficients of each channel from the upper emission path and the corresponding pupil function.(c) Estimated Zernike coefficients of each channel from the lower emission path and the corresponding pupil function.SI Fig 11.Comparison of voxel-based PSF modeling with uiPSF and traditional cross-correlation based method.
Fig 13.Comparison of different apodization terms at the presence of index mismatch aberration.The Zernike vectorial PSF model was used.Bead data were collected by imaging 40 nm red bead at z positions from -1000 nm to 1000 nm, with a step size of 50 nm.Localization was performed on the bead data used for inverse modelling.All tested apodization terms are identical if refractive indices are matched.SI Fig 14.Effect of considering lateral drift in the forward model.Data were collected from a 4Pi-SMLM system.Voxel-based PSF modelling method was used.(a) Localization of the data used for inverse modelling.The data contains bead stacks with large drift in x.The inverse modelling includes the estimation of the lateral drift.(b) Localization of a different dataset that contains small lateral drifts.The inverse modelling included the estimation of the lateral drifts, but from the data in a. (c) Localization of the same data in b.The inverse modelling assumed no lateral drifts and used the data in a.The localization bias in x is larger than the one in b, which indicates that without estimating the lateral drifts during inverse modelling, the estimated model will be affected by the lateral drifts in the data.(d) The estimated lateral drifts from inverse modelling using the data in a. SI Fig 15.Reconstruction of Nup96-mMaple in U2OS cells from 4Pi-SMLM using the PSF models estimated from bead data and single-molecule blinking data.(a) A subregion of the reconstructed Nup96 using the bead PSF model.Same as Fig .2h.(b) XY view of the selected region in a from bead and in situ PSF models.(c) XZ view of the selected region in a from bead and in situ PSF models.Here, the refractive index of sample medium is matched with the silicone oil immersion medium.For in situ model, the PSF is directly estimated from the single-molecule blinking data without the need for the additional PSF calibration.Scale bar: 2 µm (a), 100 nm (b,c).SI Fig. 16.Pupil-image based in situ PSF estimation of Tetrapod blinking patterns generated from a phase plate.Estimated PSF models and pupil images from iteration 1 to 6 are shown.PSF model converges after four iterations.The initial pupil function was generated from the 13 th Zernike polynomial (Noll order, diagonal 2 nd astigmatism) with an amplitude of -2.The PSF model and pupil from the 6 th iteration are shown in Fig. 3c.SI Fig. 17.Performance of in situ PSF modeling estimated from single-molecule blinking data.(a) Comparison of the ground truth and the estimated PSF models from simulated single molecules located in the axial range from -600 nm to +600 nm.The simulated data consist of 41 single molecules equally located at z positions between -600 nm to 600 nm with a total of 5000 photons and a background level of 10.(b) Comparison of the amplitudes of the 21 Zernike modes (blue diamonds) returned from in situ PSF modeling and the ground truth Zernike coefficients (red circles).(c) and (d) have the same parameters as (a) and (b).1000 repeated fitting calculations were performed to determine the localization precision.(c) Localization precision of the Zernike aberrations.(d) Localization precision of 3D positions.CRLB is the Cramér-Rao lower bound.SI Fig 18.Effect of number and distribution of single molecules along the z-axis on the fitting accuracy.(a) Uniform distribution of single molecules with varying total numbers along the z-axis.(b) Comparison of fitting accuracy of Zernike coefficients under the distribution described in (a).Total Nmol is the total number of single molecules in fitting.(c) Uneven distribution of 250 single molecules along the z-axis.(d) Comparison of fitting accuracy of Zernike coefficients under the distribution described in (c).Nor index is a measure of the level of aggregation of single molecules near z=0.(a) and (b) have the same simulation parameters as SI Fig. 17.It shows that single molecules with more defocus positions result in better estimation accuracy of the aberrations.SI Fig. 19.Comparison of pupil-image and Zernike based learning of double-helix PSFs from bead data.(a) Comparison of estimated PSF models and the pupil functions with one example bead data and the SLM pattern used to generate the DH-PSFs.The estimated pupil functions show a slight rotation compared to the SLM pattern, this could be caused by misalignment of the SLM and aberrations in the optical system.N represents the number of Zernike polynomials used in learning.(b) Comparison of localization biases on the bead data using different PSF models in a. Zernike-based PSF models result in large bias, even with 153 Zernike polynomials, Zernike expanded pupil function still fails to model the correct pupil for DH-PSFs.The data were collected by imaging 200 nm crimson bead (F8806, Thermo Fisher) dried on coverslip.The beads were imaged by moving the sample stage from -1 to 1 µm, with a step size of 100 nm.A silicone oil objective (NA=1.35)was used.The DH-PSFs were generated from the phase pattern in a using a spatial light modulator (Meadowlark).SI Fig 20.Example SMLM frames from different imaging systems.Example raw camera frames used for in situ PSF modelling of various imaging systems corresponding to the results in Fig. 3. Scale bars, 10 µm.SI Fig. 21.Comparison of Zernike-based FD PSF modelling using uiPSF and FD-DeepLoc.(a) FD-DeepLoc individually fits each bead, where each bead corresponds to a set of Zernike coefficients.These coefficients are interpolated to generate an FD map.However, the presence of clustered or unevenly distributed beads during the fitting process often leads to singular values on the FD map.In contrast, uiPSF performs a global fitting on all beads, optimizing the FD map globally and resulting in a relatively smoother FD map.(b) Comparison of the impact of fixing and fitting the amplitude aberration on localization accuracy.The uiFSF obtained PSF models by fitting and fixing the amplitude aberration parameter, and then used them to localize FD bead data.By comparing the results, it was found that relaxing the amplitude aberration parameter resulted in a smaller bias in PSF model localization, indicating that it better represents the true PSF.

2 = 2
,, (, , ) = � � �  ,,,,, �  − Δ , Δ are the x and y pixel sizes, Δ is the axial step size of the PSF model,  ,,,,, are the spline coefficients and   ,   ,   are the start position of each voxel (, , ).After building the spline PSF model, maximum likelihood estimation (MLE) with Poisson statistics was used to localize beads or single molecules with the objective function given by,   ��(  −   ) Both  and  are vectors of   × 1 elements.The second criterion is based on the MSE of the axial localization bias,   calculated above, forward model for each emitter results in a 2D image and is based on the vectorial PSF model, ( −   ,  −   ,   ,   ) = � �ℱ czt �ℎ�  ,   �   2( med   −    )  2�    +    � �� loss function for in situ PSF learning with a Zernike-based pupil function is, =  1  + ( bmin  bmin +  smin  smin +  zmin  min ) ,(4.4)where the definition of ,  bmin and  smin are the same as the ones for bead data.The last term constrains   and   to be positive, zmin = min(  , 0) 2 + � min(  , 0) 2The loss function for in situ PSF learning with an image-based pupil function is,  =  1  +  sm  sm + ( bmin  bmin +  smin  smin +  zmin  min ) , .8)Here,  ℎ measures the thickness of the sample chamber,  imm and  med are the refractive indices of the immersion oil and the sample medium.accountsforindexmismatch aberration, however, its calculation is different from the one defined in section 3.7 by a subtraction of  med   , where  med ,   ,  med are defined in Eq.(3.31).The term  med   only varies with   and can be absorbed into the phase variable   .Therefore   becomes a slow varying function of   .This modification decouples the variables   and   , any phase drift during the data collection will not affect the learning.In another words, emitters collected at large separation in time can be used together for learning, and the learned PSF model can be used for the entire SMLM dataset.The term   accounts for index mismatch aberration in the upper emission path when the upper objective is focused at the bottom coverslip.Because when the emitter is at the bottom coverslip   =   =0 and   =0, the only index mismatch aberration comes from the upper emission path.When the refractive index is matched,   equals to zero.The loss function is  =  1  + ( bmin  bmin +  smin  smin +  zmin  min + α  αmin ),(4.10)where zmin = min(  , 0) 2 + min(  , 0) 2 + � min(  , 0) 2 , Θ is a set of parameters being estimated,  denotes pixel index.The parameters Θ for in situ PSF learning is   ,   ,   ,   and   for each emitter,   and   ,   and pupil magnitude and phase,  and Φ , for learning pixel-wised pupil function, or the Zernike coefficients of pupil magnitude and phase, (, ),(Φ, ) , for learning Zernike expanded pupil function.The   is the forward model as defined in Eq. (3.12).The first derivatives of the forward model   with respect to the parameters Θ are given by,   = ℱ czt �ℎ�  ,   �   2( med   −    )  2�    +    � �   Θ ,, = ℱ czt � ,,med ℎ�  ,   �   2( med   −    )  2�    +    � � where