Serial Crystallography with Multi-stage Merging of 1000s of Images

KAMO and Blend provide particularly effective tools to manage automatically the merging of large numbers of datasets from serial crystallography. The requirement for manual intervention in the process can be reduced by extending Blend to support additional clustering options such as use of more accurate cell distance metrics and use of reflection-intensity correlation coefficients to infer “distances” among sets of reflec- tions. This increases the sensitivity to differences in unit cell parameters and allows for clustering to assemble nearly complete datasets on the basis of intensity or ampli- tude differences. If datasets are already sufficiently complete to permit it, one applies KAMO once and clusters the data using intensities only. If starting from incomplete datasets, one applies KAMO twice, first using cell parameters. In this step we use either the simple cell vector distance of the original Blend, or we use the more sensi- tive NCDist. This step tends to find clusters of sufficient size so that, when merged, each cluster is sufficiently complete to allow reflection intensities or amplitudes to be compared. One then uses KAMO again using the correlation between the reflections having a common hkl to merge clusters in a way sensitive to structural differences that may not have perturbed the cell parameters sufficiently to make meaningful clusters. Many groups have developed effective clustering algorithms that use a measurable physical parameter from each diffraction still or wedge to cluster the data into cate- gories which then can be merged, one hopes, to yield the electron density from a single protein form. Since these physical parameters are often largely independent from one another, it should be possible to greatly improve the efficacy of data clustering software by using a multi-stage partitioning strategy. Here, we have demonstrated one possible approach to multi-stage data clustering. Our strategy is to use unit-cell clustering until merged data is sufficiently complete then to use intensity-based clustering. We have demonstrated that, using this strategy, we are able to accurately cluster datasets from crystals that have subtle differences.

KAMO once and clusters the data using intensities only. If starting from incomplete datasets, one applies KAMO twice, first using cell parameters. In this step we use either the simple cell vector distance of the original Blend, or we use the more sensitive NCDist. This step tends to find clusters of sufficient size so that, when merged, each cluster is sufficiently complete to allow reflection intensities or amplitudes to be compared. One then uses KAMO again using the correlation between the reflections having a common hkl to merge clusters in a way sensitive to structural differences that may not have perturbed the cell parameters sufficiently to make meaningful clusters.
Many groups have developed effective clustering algorithms that use a measurable physical parameter from each diffraction still or wedge to cluster the data into categories which then can be merged, one hopes, to yield the electron density from a single protein form. Since these physical parameters are often largely independent from one another, it should be possible to greatly improve the efficacy of data clustering software by using a multi-stage partitioning strategy. Here, we have demonstrated one possible approach to multi-stage data clustering. Our strategy is to use unit-cell clustering until merged data is sufficiently complete then to use intensity-based clustering. We have demonstrated that, using this strategy, we are able to accurately cluster datasets from crystals that have subtle differences.

Introduction
KAMO (Yamashita, 2017) (Yamashita et al., 2018) and Blend (Foadi et al., 2013) provide particularly effective tools to automatically manage the merging of large numbers of datasets from serial crystallography.
The requirement for manual intervention in the process can be reduced by extending Blend to support additional clustering options thereby increasing the sensitivity to differences in unit cell parameters and allowing for clustering to assemble nearly complete datasets on the basis of intensity or amplitude differences. KAMO provides the necessary process-flow management infrastructure. This process flow is shown in Fig. 1. If datasets are already sufficiently complete to permit it, apply KAMO once and cluster the data using intensities only. If starting from incomplete datasets, one applies KAMO twice, first using cell parameters. In this step either the simple cell vector distance of the original Blend is used, or the more sensitive NCDist, to find clusters to merge to achieve sufficient completeness to allow intensities or amplitudes to be compared. One then uses KAMO again using the correlation between the reflections having common hkls (Assmann et al., 2016) to merge clusters in a way sensitive to structural differences that may not have perturbed the cell parameters sufficiently to make meaningful clusters.
Serial crystallography, an essential technique at x-ray free electron laser light sources, has become an important technique at synchrotrons (Giordano et al., 2012) (Liu & Hendrickson, 2013) (Rossmann, 2014) (Standfuss & Spence, 2017), especially at newer high-intensity synchrotron beamlines (Pearson & Mehrabi, 2020). The data may be organized either as XFEL-like still images or as thousands of wedges of data produced from very large numbers of crystals. The stills and wedges then need to be carefully organized into reasonably homogeneous clusters of data that can be merged for processing. This is going to be one of the common tools to assemble complete data from many partial wedges in MR, SAD, and ligand studies and to sort classes of crystals for studies of dynamics, binding, interactions, etc. KAMO includes cluster analysis based both on cell parameters and on reflection correlation coefficients. The clustering is based on Ward's Method, "Ward's minimum variance criterion minimizes the total within-cluster variance. To implement this method, at each step find the pair of clus-ters that leads to minimum increase in total within-cluster variance after merging. This increase is a weighted squared distance between cluster centers." (Wikipedia, 2022) In this paper we discuss the issues involved in improving the sensitivity of both approaches to clustering, using, as an example, 999 5 • wedges from lysozyme in four forms: • NAG: native with N-acetylglucosamine (NAG) soaked in, • benzamidine: native with benzamidine soaked in, • benzamidine plus NAG: native with both NAG and benzamidine soaked in.
• Native: no ligands As we will see, although the cell parameters are changed sufficiently to allow recognition of the NAG soak, it is difficult to filter the benzamidine soak simply on the basis of cell parameter changes, suggesting the desirability of switching from cell-based clustering to intensity-based clustering as early in the process as possible. Hence, cellbased clustering is more universal, in the sense that we can successfully apply it very early in the structure solution process, but it is less granular in the sense that it cannot see as much detail as intensity-based clustering. Cell-based clusters are less able to discriminate between similar but non-identical forms.
Two-stage clustering can be regarded as a method to reduce the data multiplicity needed to achieve a desired level of data precision. Hence, it is a continuation of previous techniques (White et al., 2012) that reduced the need for multiplicity compared to the Monte Carlo method of integration (Kirian et al., 2010), which makes no assumptions regarding crystal-to-crystal scaling (and hence relies entirely on statistical averaging to achieve data precision from millions of observations).
An alternative to consider, rather than staging the cell-based clustering first and then applying intensity-based clustering, would be to combine the two approaches in a single stage of higher dimensionality. There are two problems that would have to be addressed in order to create such a single-stage combined algorithm. First, it is not possible to apply intensity-based clustering at all without first indexing all reflections in a way that provides a common unambiguous label for each reflection in each image to correctly identify corresponding reflections with the same index in each different image. One option to satisfy this requirement is to index all diffraction images in P1 or some other low symmetry common to all images. This may require more images and more reflections than are available and, worse, because symmetry is being ignored, may bring together for merging images describing very different molecular conformations.
Second, as the number of independent parameters rises, all clustering methods work increasingly poorly due to the "curse of dimensionality" (Bellman, 1956). Combining the very-well-behaved low-dimensionality cell-based clustering with the highdimensionality intensity-based clustering can obscure the results of cell-based clustering. In most cases, unless the information to be gained from cell-based clustering is available a priori, it is more useful to process the cell-based clustering information first and then to move on to intensity-based clustering in a second stage based on the cell-based results.
For cases in which the indexing is too ambiguous to reliably start the process, the approach used in dials.cosym should be considered (Gildea & Winter, 2018), which uses the averages of intensities within images and from multiple images simultaneously with spot positions for indexing, dealing with the issue of the curse of dimensionality by using the averaging to limit the increase in dimension as much as possible (Brehm & Diederichs, 2014) (Diederichs, 2017). This is not the same as simultaneously doing full cell-based and full intensity-based clustering, which is not recommended.

Limits of conventional clustering
Since our goal was to expand the capabilities of existing clustering techniques, we began by applying a conventional clustering strategy to diffraction data from lysozyme micro-crystals containing various combinations of known small molecule binders. Microcrystals were preferred to avoid the conflation of structurally anisotropic data that has been demonstrated in larger crystals; see (Thompson et al., 2018).
Lysozyme micro-crystals suitable for acoustic harvesting (Soares et al., 2011) were grown using batch crystallization by dissolving 120 mg/ml lysozyme in 0.2M sodium acetate pH 4.6 (Hampton Research HR7-110) and combining with equal parts precipitant (10% ethylene glycol + 12% sodium chloride) (Roessler et al., 2016). The resulting slurry of 5 -10 µm crystals was divided into four aliquots. Three of the four aliquots were then equilibrated overnight with an equal volume of 0.5 M solutions of, respectively, benzamidine, NAG, and benzamidine plus NAG. These two small molecules are known to bind tetragonal lysozyme crystals (Yin et al., 2014). The fourth aliquot was diluted with an equal volume of mother liquor but contained no ligands.
The diffusion rate for benzamidine and NAG within lysozyme crystals is approximately 1µm/s (Cole et al., 2014). To prevent cross-contamination of crystals with neighboring forms, crystals could not be mixed with different forms for more than 1s before diffusion was halted by plunge cryo-cooling in LN 2 . To accomplish this, we deposited 5 µL of crystal slurry from each aliquot onto a separate agarose support (Cuttitta et al., 2015). We used acoustic sound pulses to harvest 2.5 nL of crystal slurry from each of the four lysozyme aliquots and separately positioned them on a micro-mesh (MiTeGen M3-L18SP-10) so that none of the droplets was in contact with any other. Crystal-containing droplets were threaded through small apertures to prevent cross-contamination (Foley et al., 2016). We then swept the non-crystalcontaining side of the micro-mesh against a sponge moistened with cryo-protectant (mother liquor + 20% glycerol) and, in one smooth motion, immediately cryo-cooled the micro-mesh in LN 2 . In addition to cryo-protection, this also mixed the crystals together into one contiguous field. The same procedure was repeated for a micro-mesh containing only two lysozyme forms, benzamidine plus NAG and native. Serial diffraction data were then obtained in 5 degree wedges from approximately 100 crystals on each micro-mesh.
The software package KAMO was then used in default configuration (in which data are clustered based only on unit cell similarity) to partition the diffraction data from micro-meshes containing the four lysozyme forms into four clusters, and the diffraction data from micro-meshes containing two lysozyme forms into two clusters. Each cluster of data was merged and then phased using the known structure of lysozyme. Subsequently the atomic model was refined using refmac (Winn et al., 2003), and an omit difference map was examined using coot in the region where the ligands were expected to bind to the protein surface (Emsley & Cowtan, 2004). The omit difference map was contoured at 1.5 σ and displayed using Pymol (DeLano, 2002). The omit maps calculated from the four-way clustering data were not observed to match closely any of the four lysozyme forms known to have been acoustically harvested onto the micro-meshes (data not shown). We concluded from this result that the clustering algorithm was not sufficiently sensitive to differentiate these four classes of very similar crystals using only variations in the observed unit cell parameters. However, the omit maps calculated from the two-way clustering data were a good fit to the expected lysozyme forms (Fig. 2). We concluded from this result that the two-ligand form was sufficiently different from the native form that unit-cell-based clustering could be successful. To do the four-way split, we combined the universality advantage of cell-based clustering ( §3) with the granularity advantage of intensity-based clustering ( §4).

Clustering on Cell Parameters
Stills and wedges of very low completeness are more appropriate for cell parameter clustering, rather than reflection clustering, because pairs of images with very few commensurate reflections may still provide reasonable estimates of unit cells but not provide enough data to compute a meaningful distance between sets of reflections.
The default Blend approach to clustering on cell parameters is to use [a, b, c, α, β, γ] as a six-vector, do a principal component analysis (PCA), drop the components without significant variance, and use the Euclidean distance calculated from the remaining components. This approach does not deal as effectively with the discontinuities produced by experimental error and ambiguities in reduction (e.g. between Type I and Type II cells and near the cubic unit cells) as does the Andrews-Bernstein Niggli-Cone Distance (NCDist) algorithm (Andrews & Bernstein, 2014). NCDist allows slightly larger clusters of truly similar datasets to be formed, working in the space G 6 which uses Niggli reduction in the six-dimensional space consisting of the metric tensor with the last three components doubled, [a 2 , b 2 , c 2 , 2bc cos(α), 2ac cos(β), 2ab cos(γ)].
In our test case of 999 datasets of lysozyme with NAG and benzamidine soaks 998 clusters are found with completeness ranging from 40% to 100%. The top levels of the two dendrograms are shown in Figs. 3 and 4. The clusters are labelled by Linear Cell Variation (LCV), which "measures the maximum linear increase or decrease of the diagonals on the three independent cell faces. Values below 1% in general indicate a good degree of isomorphism among different crystals. Structural differences start to be noticeable with LCV greater than 1.5%. A value in angstroms associated to LCV is provided by the absolute Linear Cell Variation (aLCV)." (Foadi et al., 2013) Note the smaller Ward distances i.e. tighter clusters for the equivalent clusters in the latter, NCDist-based dendrogram as compared to the former.
The dendrograms are qualitatively similar but for this test data the discrimination of the clustering changes. Using the original Blend algorithm, the largest clusters that are 100% native, 100% NAG, 100% benzamidine and 100% benzamidine plus NAG contain 4, 15, 5, and 10 datasets, respectively. Using the NCDist clustering, the largest clusters that are 100% native, 100% NAG, 100% benzamidine and 100% benzamidine plus NAG contain 9, 15, 8, and 7 datasets, respectively. This provides a better base for switching over from cell clustering to reflection clustering; half of the 100% pure clusters are larger using NCDist.

Clustering on Reflections
In a regime of high completeness, say 90%, different datasets can have enough reflections with common hkl's to generate a satisfactory similarity or distance for clustering.
If the data have been scaled, an R-value can be used as a distance, but for unscaled data the preferred approach is to use a Pearson Correlation Coefficient (CC) as a measure of similarity, i.e. having a larger value of the coefficient for sets of reflections that are similar and a smaller value of the coefficient for sets of reflections that are dissimilar. The Pearson Correlation Coefficient is essentially the cosine of the angle between vectors of data. The lack of common scaling is dealt with by subtracting the mean (µ) of each vector from each component and dividing by the norm (||..||) of each to get two unit length vectors. Recall that the dot product (·) of two vectors is equal to the product of the norms of the two vectors times the cosine of the angle between them.
CC(data_set 1 , data_set 2 ) = vec 1 · vec 2 ||vec 1 || ||vec 2 || In order to extend the range of applicability of CC, we convert it to a distance, This choice for SFDist allows the distance to be adjusted to account for the greater uncertainty in cases where a pair of data sets has few common reflections, less than 90%, for example, by adding a penalty to the distance for each unmatched reflection. The impact of using clustering on reflections for larger clusters can be seen by looking at how well-represented reasonably pure clusters are. In Figs. 11 and 12, we have represented the purity of native, NAG, benzamidine, and benzamidine plus NAG species using NCDist and SFDist.

Impact of choices in clustering
The extreme variations in the SFDist results suggest two important lessons.
• It is best to use reflection-based clustering starting from datasets that are small enough to still be likely to be pure species, i.e. use cell-based clustering only just far enough to reach sufficient completeness that the intensity-based clustering can be handled.
• It is not necessarily desirable to continue clustering to the largest of the "best" possible clusters. Smaller clusters of sufficient quality for processing are more likely to be pure species.

Discussion
Because micro-crystals are expected to react quickly and uniformly to changes in their environment, serial crystallography is a desirable tool for examining the plasticity with which protein crystals respond to external perturbations. In some cases the external perturbation can be physical, such as conformational changes induced by light (Young et al., 2016). In other cases proteins are perturbed by chemical means (Fromme, 2015).
Often it is not possible to to draw a sharp boundary between diffraction images from different protein forms without the assistance of some type of clustering software. In response to this, many groups have developed effective clustering algorithms that use a measurable parameter from each diffraction still or wedge to cluster the data into categories which can then be merged to, hopefully, yield the electron density from a single protein iso-form. Examples of measurable parameters that have been used for this purpose include unit cell dimensions (Foadi et al., 2013) (Zeldin et al., 2015), and diffraction intensities (Assmann et al., 2016) (Diederichs, 2017). What is striking about many of these physical parameters is that they are largely independent from one another.
Consequently, it should be possible to greatly improve the efficacy of data clustering software by combining quasi-independent information in a multi-stage parti-tioning strategy (as presented here). An alternative one might consider in some cases would be to combine the same data in a single higher-dimensional (more independent parameters) stage. However, all clustering methods work increasingly poorly as the number of independent parameters rises due to the "curse of dimensionality" (Bellman, 1956). Combining very well-behaved low-dimensionality cell-based clustering with high-dimensionality intensity-based clustering gives up the advantage of the reliability of cell-based clustering. In most cases, unless all the information to be gained from cell-based clustering is available a priori, it is probably best to take advantage of that information first and then move on to intensity-based clustering in a second stage, as we have done here.
We have demonstrated one possible approach to multi-stage data clustering. Our strategy was to use unit-cell-based clustering until merged data was of sufficient completeness to then use intensity-based clustering. We have demonstrated that, using this strategy, we were able to accurately cluster datasets from crystals that had subtle differences. Certainly if one is dealing with a case in which the "correct" symmetry and indexing of all reflections are known for all images, it makes sense to do only intensity-based clustering, but in the general case doing cell-based clustering first make sense.

Availability of Data
The HKL structure factor files and BLEND clustering data files used for the final

A.1. Electron Counting of Ligand Occupancies
The occupancy of benzamidine (BEN) and N-acetylglucosamine (NAG) ligands deduced from the integrated electron density in the volume occupied by each ligand is shown in Fig. A1. The data clusters 3 (Fig. A4), 28 (Fig. A5), 43 (Fig. A6), and 62 (Fig. A7) were obtained using two-stage clustering (BLEND NCDist followed by BLEND SFDist ), as described in §4. Electron density maps were placed on an absolute scale using iterative density modification (Soares & Caspar, 2017)

A.2. Data Reduction and Structure Solving Statistics
The data reduction and structure solving statistics are given in Tables A1, A2 and A3. Because the largest clusters merge structures with significantly different ligands some of the R-factors degrade rather than improve. This is only to be expected. See (Kleywegt & Jones, 1997) for more information on R f ree and R work . Fig. 1. Process flow in use of KAMO and Blend. In the case of the four-way clustering discussed in §3 and §4, a total of 896 data sets were input to the first stage NCDist clustering engine and a total of 107 data sets were input to the second stage SFDist clustering engine (first and second rectangles in Fig. 1).

Fig. 2.
Electron density maps calculated after two-way clustering of diffraction data obtained from micro-meshes that contained a mixture of doubly-bound crystals (benzamidine plus NAG; Fig 2A) and native crystals (no ligands; Fig. 2B). The omit difference maps are contoured at 1.5 σ in the region expected to contain benzamidine (top) and NAG (bottom). The histogram cluster in Fig 2C represents the unit cell dimensions of the cluster of crystal datasets that yielded the omit difference map shown in A. Similarly, the histogram cluster on the right in C represents the unit cell dimensions of the cluster of crystal data shown in B. Clearly the clustering algorithm was able to accurately partition the data for this simple two-way split. See A.1 in Supplementary Materials. Fig. 3. This dendrogram presents the top levels of Blend clustering using the original less-sensitive Blend cell-parameter distance function. The numbers are the LCV and the aLCV, with the aLCV in parentheses. Fig. 4. This dendrogram presents the top levels of Blend clustering using the more sensitive Andrews-Bernstein Niggli-Cone-Distance (NCDist) algorithm. The numbers are the LCV and the aLCV, with the aLCV in parentheses. Clustering is guided by the progressive merging of separate clusters into larger ones, using a measure of cluster proximity known as Ward distance. This is equal to the increase of the distance variance (between each element of a cluster and its centroid) resulting from the merging of two separate clusters (Ward, 1963). Note that the Ward distances are smaller than those for the equivalent clusters in Fig. 3. Fig. 5. Omit difference map of the NAG site in cluster 28 of a two-stage clustering with KAMO using cell parameters and NCDist to reach 10% completeness and then CC clustering with SFDist. Fig. 6. Omit difference map of the NAG site in cluster 43 of a two-stage clustering with KAMO using cell parameters and NCDist to reach 10% completeness and then CC clustering with SFDist. Fig. 7. Omit difference map of the NAG site in cluster 62 of a two-stage clustering with KAMO using cell parameters and NCDist to reach 10% completeness and then CC clustering with SFDist. Fig. 8. Omit difference map of the benzamidine site in cluster 28 of a two-stage clustering with KAMO using cell parameters and NCDist to reach 10% completeness and then CC clustering with SFDist. Fig. 9. Omit difference map of the benzamidine site in cluster 43 of a two-stage clustering with KAMO using cell parameters and NCDist to reach 10% completeness and then CC clustering with SFDist. Fig. 10. Omit difference map of the benzamidine site in cluster 62 of a two-stage clustering with KAMO using cell parameters and NCDist to reach 10% completeness and then CC clustering with SFDist. Fig. 11. Color charts of the 35 largest dataset clusters for the NCDist clustering. From top to bottom the color blocks are the native soak, the benzamidine plus NAG soak, the benzamidine soak and the NAG soak. If one color reaches nearly from the bottom to the top at a given position, that cluster is a nearly pure species.

Synopsis
The effectiveness of clustering to merge datasets from large numbers of crystals in serial crystallography can be improved by combining multiple clustering techniques using cell parameterbased clustering for very incomplete sets and switching to reflection-based clustering once preliminary merging has increased completeness.