In situ single particle classification reveals distinct 60S maturation intermediates in cells

Previously, we showed that high-resolution template matching can localize ribosomes in two-dimensional electron cryo-microscopy (cryo-EM) images of untilted Mycoplasma pneumoniae cells with high precision (Lucas et al., 2021). Here, we show that comparing the signal-to-noise ratio (SNR) observed with 2DTM using different templates relative to the same cellular target can correct for local variation in noise and differentiate related complexes in focused ion beam (FIB)-milled cell sections. We use a maximum likelihood approach to define the probability of each particle belonging to each class, thereby establishing a statistic to describe the confidence of our classification. We apply this method in two contexts to locate and classify related intermediate states of 60S ribosome biogenesis in the Saccharomyces cerevisiae cell nucleus. In the first, we separate the nuclear pre-60S population from the cytoplasmic mature 60S population, using the subcellular localization to validate assignment. In the second, we show that relative 2DTM SNRs can be used to separate mixed populations of nuclear pre-60S that are not visually separable. 2DTM can distinguish related molecular populations without the need to generate 3D reconstructions from the data to be classified, permitting classification even when only a few target particles exist in a cell.

biogenesis in the Saccharomyces cerevisiae cell nucleus. In the first, we separate the nuclear pre-23 60S population from the cytoplasmic mature 60S population, using the subcellular localization to 24 validate assignment. In the second, we show that relative 2DTM SNRs can be used to separate 25 mixed populations of nuclear pre-60S that are not visually separable. We use a maximum 26 likelihood approach to define the probability of each particle belonging to each class, thereby 27 establishing a statistic to describe the confidence of our classification. Without the need to 28 generate 3D reconstructions, 2DTM can be applied even when only a few target particles exist in 29 a cell. 2021). However, the assignment of states is unreliable for similar structures that can only be 54 distinguished using high-resolution detail, and statistical approaches to quantitatively assess 55 classification results are lacking. Machine learning has been employed for particle classification 56 in tomograms, but currently only performs as well as a human operator (Moebel et al., 2021). 57 While machine learning algorithms performed better than 3D template matching at molecule 58 localization in tomograms, classification remained challenging for all algorithms (Gubins et al., 59 2020). In situ molecule classification, therefore, remains a major challenge. 60 We recently described an alternate method to locate particles that may improve structural 61 classification in cells. By using 2D cryo-EM images, rather than tomograms, and fine-grained, 62 high-resolution template matching (2DTM), specific particles can be located in cells with high 63 precision using their atomic structures (Lucas et al., 2021;Rickgauer et al., 2020Rickgauer et al., , 2017. 2DTM 64 uses molecular models, from in vitro structure determination or in silico structure prediction 65 (e.g., Alphafold2 (Jumper et al., 2021)) to generate a 3D density. This 3D density (hereafter 66 referred to as the template) is then projected in 2D along millions of orientations. A pixel-wise 67 cross-correlation of the 2D projections with a high-resolution 2D cryo-EM image is performed, 68 yielding a 2DTM signal-to-noise ratio (SNR) at every pixel location (Rickgauer et al., 2017). 69 The 2DTM SNR values are subjected to a significance test, which identifies peaks with a desired 70 level of confidence (Lucas et al., 2021;Rickgauer et al., 2017). In the following, we refer to 71 targets passing this test as significant targets (Lucas et al., 2021;Rickgauer et al., 2017). 72 The 2DTM SNR is proportional to template mass and negatively affected by non-73 matching elements between template and target (Lucas et al., 2021;Rickgauer et al., 2020Rickgauer et al., , 74 2017. We have shown that a template generated from a Bacillus subtilis 50S large ribosomal 75 subunit was able to detect 50S in 2D cryo-EM images of Mycoplasma pneumoniae cells, but 76 with a lower average 2DTM SNR compared to a M. pneumoniae 50S template (Lucas et al., 77 2021). This demonstrated that (1) 2DTM using partially matching templates can be sufficiently 78 sensitive to yield significant targets and (2) the mean 2DTM SNR of detected targets provides a 79 read-out of the relative similarity between different templates and populations of particle species. 80 In this study, we investigate whether the ratio of 2DTM SNRs obtained using different 81 templates can be used to identify the template that more closely resembles the cellular target, and 82 thereby classify particles in cells. As a model system, we chose to examine the late stages of 60S 83 ribosomal subunit biogenesis in the yeast Saccharomyces cerevisiae because (1) intermediates 84 are of a similar size and share significant structure with one another, making them difficult to 85 separate at low resolution, (2) molecular models spanning multiple late intermediate states  To evaluate the utility of 2DTM to locate molecules in FIB-milled lamellae, we collected 109 28 2D cryo-EM images of the nuclear periphery of lamellae generated from actively growing 110 To assess the specificity of 60S detection, we identified regions of the images 116 corresponding to the cytoplasm, nucleus and vacuole by visual inspection. Consistent with the 117 expected high specificity of 2DTM, we did not observe any significant mature 60S-detected 118 targets in regions of the image corresponding to the vacuole (Figure 1C-D). In contrast, 229 119 mature 60S-detected targets localized to the nucleus, representing ~5% of all mature 60S 120 identified targets in these images (Figure 1C-D). In regions of the images corresponding to the 121 cytoplasm we observe a median density of ~6500 60S/ m 3 , which, assuming an average cell 122 volume of ~42 m 3 of which ~65% is cytoplasm, corresponds to a total of ~180,000 60S/cell 123 ( Figure 1G). This is consistent with prior estimates of 187,000 ± 56,000 ribosomes per yeast 124 cell based on rRNA concentration (von der Haar, 2008). 125 Beyond the subcellular distribution of mature 60S-detected targets, we also confirmed 126 that 2DTM identified specific 60S in biologically relevant locations and orientations. The 127 nuclear envelope (NE) is contiguous with the endoplasmic reticulum and a known site for co-128 translational transport of transmembrane and secretory proteins, while the vacuole is not known 129 to be a site of translation. We found that mature 60S-detected targets were oriented with their 130 polypeptide exit tunnels facing the cytoplasmic surface of the NE but were depleted from within 131 ~20 nm of the vacuole (Figure 1C,F). This indicates that the orientation of 60S identified by 132 2DTM is unlikely to be an artefact introduced by features of the membrane in the image. To 133 confirm that the targets identified with the mature 60S template reflect ribosomes, we generated 134 a 3D reconstruction using the locations and orientations of 3991 significant mature 60S-detected 135 targets using standard single particle approaches as described previously (Lucas et al., 2021). In 136 addition to the 60S the 10 Å-filtered reconstruction showed density consistent with the 40S small 137 ribosomal subunit ( Figure 1H). This is consistent with many of the mature 60S detected targets 138 representing a population of 80S ribosomes. We conclude that 2DTM-identified locations and 139 orientations in 2D cryo-EM images of FIB-milled lamellae reflect biologically relevant locations 140 and orientations of ribosomes in the cell. 141 142 Relative 2DTM SNRs enable single particle classification in situ 143 The nuclear envelope (NE) creates a physical barrier that separates premature 60S in the 144 nucleus from mature 60S in the cytoplasm and is easily distinguishable in many 2D images by its 145 characteristic double membrane and by the more granular appearance of the cytoplasm vs the 146 nucleus (e.g., Figure 1B). Our observation of a substantial population of mature 60S-detected 147 targets in the nucleus, but not in the vacuole (Figure 1C-D), suggests that the nuclear 60S may 148 result from cross-detection of nuclear precursors, which share part of their structure with mature 149 60S and therefore also produce significant correlations (Figure 2A) (Figure 2A,B), and annotated each target by its subcellular localization. The LN 60S was 153 chosen because it represents the most mature nuclear intermediate for which there is a structure, 154 and which retains ribosome biogenesis factors (RBFs) that are removed during nuclear and early 155 cytoplasmic processing (Figure 2A). Thus, we expect that (1) the similarities between the 156 mature 60S and LN 60S structures will result in cross-detection of the respective other complex 157 and (2) the cytoplasmic population will more closely resemble the mature 60S and nuclear 158 population will more closely resemble the LN 60S resulting in a higher mature 60S / LN 60S 159 2DTM SNR ratio in the cytoplasm than the nucleus. In the 28 images of the nucleus and nuclear 160 periphery we located 1651 significant LN 60S-detected targets of which 1382 (~84%) of the LN 161 60S-detected targets were cytoplasmic and 268 (16%) were nuclear (Figure 2-figure  162 supplement 1A). We identified more cytoplasmic than nuclear targets in 2DTM searches with 163 both mature and LN 60S templates because (1) the cytoplasm represented a larger area of our 164 images and (2) the concentration of 60S is expected to be higher in the cytoplasm relative to the 165 nucleus (e.g., (Delavoie et al., 2019)). Only one of the significant LN 60S-detected targets 166 localized to the vacuole, which is below the expected false positive rate and further indicates the 167 specificity of 2DTM. 168 As expected from the similarity between the mature and LN 60S templates, the locations 169 of many of the targets identified in the two searches overlap (Figure 2B,C). We aligned the two 170 sets of coordinates using the program align_coordinates (Lucas et al., 2021). Approximately one 171 third of the mature 60S-detected targets overlapped with LN 60S-detected targets while 92% of 172 the LN 60S-detected targets overlapped with mature 60S-detected targets ( Figure 2H). 173 Combining the results of both searches, only 0.5% of the cytoplasmic targets were LN 60S-174 detected only, compared to 30% of the nuclear targets ( Figure 2I). To gain an understanding of cell biology at molecular resolution it is necessary to be able 192 to confidently assign particle identity to individual targets. We show above that the nuclear and 193 cytoplasmic 60S populations were significantly different with respect to their relative similarity 194 to the LN and mature 60S (Figure 2). We also show that classifying targets by their highest 195 2DTM SNR effectively separates the nuclear from the cytoplasmic population (Figure 2). 196 However, a single threshold does not fully capture the differences between the nuclear and 197 cytoplasmic populations and for an individual particle the confidence of classification is unclear. 198 To assign a confidence in the class assignments of detected particles we developed a 199 maximum likelihood-based approach to infer the probability of a particle deriving from one of a 200 given number of populations. We sought to classify each of the 1531 LN and mature 60S-201 detected targets by their relative similarity to the LN 60S or mature 60S templates. We restricted 202 our analysis to the targets that were detected by both templates to limit the contribution from 203 noise. We made the initial simplifying assumption that: 1) each 60S identified more closely 204 reflects either LN or mature 60S, i.e., the number of classes needed to describe all detected 205 targets is two; 2) the nuclear targets more closely resemble the LN 60S and the cytoplasmic 206 targets more closely resemble the mature 60S. We therefore define the prior probability that a 207 randomly selected detected target belongs to a specific population according to the number of 208 targets detected in the nucleus and cytoplasm, respectively ( Figure  We used a maximum likelihood-based approach to model the log2(mature / LN 60S 213 2DTM SNR) values as a mixture of two Gaussians ( Figure 3A, R 2 = 0.993). The fit suggests a 214 major population that more closely reflected the mature 60S and a smaller population that more 215 closely reflected the LN 60S ( Figure 3A). Using the Gaussian distribution model (see Materials  216 and Methods), we calculate the probability that a LN and mature 60S-detected target with a 217 given log2(mature / LN 60S SNR) value more closely resembles the LN 60S than the mature 60S 218 via Bayes rule ( Figure 3B-C). This analysis could easily be extended to cases where more than 219 two templates are used in the search (see Materials and Methods). A confidence threshold of 220 95% assigns 27% of the nuclear targets and only ~0.2% of the cytoplasmic targets to the LN 60S 221 population ( Figure 3C). Defining a threshold at 50% classifies ~75% of the nuclear targets as 222 LN 60S and 92% of the cytoplasmic targets as mature 60S ( Figure 3C). The relative probability 223 of each detected 60S belonging either to the LN or mature 60S population can be readily 224 visualized ( Figure 3D). This shows that the 2DTM SNR ratio can effectively delineate 225 populations of related particles in cells with a specified confidence for each particle assignment. Additionally, several rRNA helices on the intersubunit interface are in different conformations, 232 specifically the L1 stalk, helix 38 and helix 89, which undergo conformational changes during 233 maturation ( Figure 4C). To identify which of these features distinguish nuclear from 234 cytoplasmic 60S, we investigated the relative dependence of the 2DTM SNRs on the rRNA and 235 proteins of the LN 60S template. We generated truncated LN 60S templates containing either 236 rRNA or protein only and calculated the change in the 2DTM SNR for each template at each 237 target relative to the full-length template ( Figure 4D). The rRNA contributed 1.5 and 1.8-fold 238 more to the 2DTM SNR of the nuclear and cytoplasmic targets, respectively, despite comprising 239 only 1.25-fold more of the template mass (1004 and 800 kDa, respectively), than the proteins 240 ( Figure 4D). Indeed, 60% of the cytoplasmic targets and 34% of the nuclear targets were no 241 longer significant when searching with the proteins alone. Comparing the nuclear and 242 cytoplasmic populations shows that the 2DTM SNR of the LN 60S-detected cytoplasmic targets 243 is less affected by the removal of the LN 60S proteins and more strongly affected by the removal 244 of the rRNA ( Figure 4D). This shows that the LN 60S proteins contribute more to the SNR of 245 the nuclear targets than the cytoplasmic targets and are therefore more effective at differentiating 246 the nuclear from the cytoplasmic 60S population. 247 Since the LN 60S represents a late intermediate of 60S maturation in which the rRNA is 248 almost fully folded, RBF proteins on the LN 60S account for most of the difference with the 249 mature 60S by mass (Figure 4A-D). To confirm that the SNR difference of nuclear LN 60S-250 detected targets and cytoplasmic mature 60S-detected targets is primarily due to the RBF 251 proteins, we removed the RBFs from the LN 60S template and recalculated the SNR for each 252 target. The removal increased the 2DTM SNR ratio of the cytoplasmic targets, while decreasing 253 the 2DTM SNR of the nuclear targets ( Figure 4E), making the SNR values more similar. This is 254 consistent with the nuclear population having these RBFs and the cytoplasmic population lacking 255 the RBFs. We conclude that the differentiation of detected targets using the observed 2DTM 256 SNRs reflects biologically relevant differences between them. 257 258

Nog2 lacking intermediates accumulate after inhibition of nuclear export 259
The two largest RBFs on the LN 60S are Nog1 and Nog2, together accounting for ~50% 260 of the RBF mass ( Figure 4F,G). During 60S maturation, Nog2 removal is required to permit 261 binding of the nuclear export adaptor Nmd3 and Crm1-dependent export, and therefore Nog2 262 Cross-detection of targets by different templates can be used to detect heterogeneity in 304 target populations. When examining the SNR ratios of targets identified by both EN and LN 60S, 305 the cytoplasmic targets display a distribution that is consistent with a single population that more 306 closely resemble the LN 60S template (Figure 5-figure supplement 1B, red). The distribution 307 of nuclear targets, however, was consistent with at least two populations (Figure 5-figure  308 supplement 1B, blue), each of which is distinct from the cytoplasmic population. This indicated 309 the presence of at least two nuclear populations that differ with respect to their relative similarity 310 to the EN and LN 60S templates. 311 We next sought to classify the EN, LN and mature 60S-detected targets based on their 312 relative similarity to the three 60S templates. For each target we calculated the log2(mature 60S / 313 LN 60S SNR) and log2(EN 60S / LN 60S SNR) values. We used these values to classify each 314 target based on the relative similarity to the three templates using the maximum-likelihood 315 approach discussed above ( Figure 5C). We found that, consistent with their expected subcellular 316 distributions, targets assigned to the mature 60S population represented 315 (85%) of the 317 cytoplasmic targets and only 1 (<1%) of the nuclear targets detected by all three templates 318 ( Figure 5D). In contrast, the EN 60S population represents 83 (70%) of the nuclear population 319 and only 4 (~1%) of the cytoplasmic population detected with all three templates ( Figure 5D). 320 The LN 60S population was roughly evenly distributed between the nucleus and the cytoplasm, 321 consistent with this structure representing a late maturation intermediate ( Figure 5D). 322 The NE provides a convenient visual control for the classification of targets as LN / EN 323 60S or mature 60S (e.g., Figure 1). However, there are no clear features in the nucleoplasm that 324 would enable visual separation of different populations of nuclear intermediates and thereby 325 confirm their classification. To validate our classification of the nuclear pre-60S populations, we 326 identified conditions wherein the relative occupancy of the two states would be expected to 327 change. We show above that inhibiting Crm1-mediated export results in accumulation of nuclear 328 intermediates that lack Nog2 (Figure 4). In cells with active Crm1, 57% of the nuclear 60S 329 targets are assigned to the EN 60S population ( Figure 5E). After inhibition of Crm1-mediated 330 export, the EN 60S population is mostly depleted, and >90% of targets are assigned to the LN 331 60S population ( Figure 5E). This confirms that 2DTM SNR ratios can be used to effectively 332 classify mixed populations of particles in cells. 333 334

Discussion: 335
The immense potential for cryo-EM to reveal the molecular detail of biological processes 336 in cells is currently largely unrealized. One of the major bottlenecks is the lack of reliable, 337 quantitative methods to locate and characterize molecules in cells. Here we describe the 338 application of 2DTM to in situ particle classification. By considering the relative 2DTM SNRs of 339 alternate templates at a single location and orientation, we separate 60S precursors in the nucleus 340 from mature 60S in the cytoplasm. We also show that a maximum likelihood approach 341 effectively classifies a mixed population of nuclear pre-60S into at least two maturation states 342 with a specified confidence for each particle. We show that 2DTM can be used to probe the 343 composition of complexes in situ by modifying 2DTM templates. In this study we extend the 344 utility of 2DTM beyond a binary indicator of detection to provide a quantitative assessment of 345 particle identity. In many images, 60S subunits detected by 2DTM also generate low-resolution contrast in 356 the cytoplasm that is readily visible (Figure 1B, yellow arrows). In the nucleoplasm, the similar 357 density of RNA and DNA impedes the visual identification of all but a few pre-60S (Figure 1B,  358 blue arrows). However, the reduced low-resolution contrast does not preclude effective detection 359 of pre-60S with 2DTM. This is in contrast to particle localization in tomograms, wherein 360 detection depends more strongly on low-resolution contrast and recognizable shapes. The ability 361 to distinguish particles in crowded molecular environments is a major advantage of 2DTM 362 relative to cryo-ET, which currently suffers from strong attenuation of high-resolution signal 363 orientation can be used to calculate the relative probabilities of a target belonging to a specific 375 particle population. 376 Of the nuclear targets identified with the mature 60S, ~50% were also detected with the EN 377 60S, all of which were also detected with the LN 60S ( Figure 5B). When calculating the relative 378 similarity to the three 60S templates, the EN 60S and mature 60S population were clearly 379 distinct, with mean 2DTM SNR ratios more than three standard deviations apart (Figure 5C). 380 The maximum likelihood estimation of Gaussian distributions enables quantitative classification 381 even when particle populations are less distinct, by yielding relative probabilities for each 382 detected target belonging to one of a given number of populations (e.g., Figures 3&5). 383 In this study, we effectively classify at least three populations of 60S maturation states from a 384 population of <500 molecules (Figure 5). This means that given sufficient abundance of the 385 target, it will be possible to distinguish populations based on data from a single image (Figure 2-386   figure supplement 1D). This contrasts with more traditional (reference-free) methods used to 387 classify subtomograms and single particles, which require hundreds to thousands of particles to 388 generate the class averages needed for particle assignment. 2DTM allows single molecule 389 classification from fewer images, and therefore enables more information to be extracted from 390 images collected from cells and purified samples (single-particle cryo-EM). 391 392

Confidence metric for single particle classification in situ 393
Calculating the confidence in class assignment of individual particles will aid interpretation 394 of the results of 2DTM in situ. One major difference between in situ cryo-EM and single-particle 395 cryo-EM is the type of biological information that is obtained. In single-particle cryo-EM, the 396 goal is to generate high-resolution maps and establish the arrangement of atoms within a 397 complex in different functional states, and to use this information to discern its molecular 398 mechanism. In this case, B-factors and other metrics can be used to indicate uncertainty about an 399 atomic coordinate, which aids interpretation of the model built into the map. In the cell, each 400 individual instance of a complex may be in a different context relative to other similar molecules. 401 For example, particles might be in different subcellular compartments such as the nucleus or 402 cytoplasm or, as a more extreme example, a single particle within a nuclear pore exists in a very 403 different context than particles in the nucleoplasm. For structural cell biology applications, 404 therefore, it is useful to define a metric to establish the confidence of single particle 405 classification. In this study, we show that a maximum likelihood approach using Gaussian fits to 406 log2 2DTM SNR ratios of alternate templates at a specific subcellular location and orientation 407 can be used to calculate the relative probability of a single particle deriving from one of a given 408 number of classes. This provides a quantitative metric to establish confidence in the assignment 409 of single particles that will aid in the biological interpretation of cellular cryo-EM maps. 410

2DTM templates as computational molecular probes 412
A major challenge in biological cryo-EM the retrieval of detailed structural information of 413 inherently flexible and heterogeneous macromolecules from noisy images collected at low dose 414 to limit radiation damage. In single particle cryo-EM, this problem is addressed by averaging 415 images of thousands of purified molecules to identify different structural states at high 416 resolution. By averaging images of many identical copies of a particle, novel structures can be 417 discovered, and this is a clear strength of this approach. However, since most complexes have a 418 low abundance in the cell, the utility of this approach for in situ structural biology is limited to 419 all but the most abundant complexes. 420 2DTM presents an alternate approach to using the signal in noisy images to gain insight into 421 the structural states of molecules. In this approach, a noise-free template represents a hypothesis 422 that a particle of a given conformational and compositional state is present in the image, and this 423 hypothesis can be tested by searching the image with the template, independent of how many 424 particles the image contains. We demonstrate that by generating modified templates representing 425 different hypotheses, we can directly assess the compositional and conformational states of 426 ribosomal subunits in cells. 427 Provided the templates have similar molecular mass and shape and are aligned with each 428 other, probing with multiple templates requires only a single initial exhaustive search with one of 429 the templates. This can be followed by a simple evaluation of the cross-correlation coefficient for 430 each additional template at locations and orientations of the detected targets in the initial search 431 (Figure 4), thereby avoiding time-consuming searches for all templates. In future studies, this 432 approach could be extended to assess the relative similarity of a target with respect to a library of 433 alternate structures. Alternate templates could be generated in multiple ways, depending on the 434 biological hypothesis being tested. To reveal compositional heterogeneity in situ, alternate 435 structures could be generated that lack specific subunits of interest as shown in Figure 4. In the present study, we only considered three alternate 60S templates. We note that the 483 Gaussian fits to the 2DTM SNRs of mature 60S and LN 60S-detected nuclear targets is 484 imperfect, potentially indicating additional pre-60S populations (Figure 2-figure supplement  485   1C). Further examination of the observed 2DTM SNR ratios revealed the presence of at least one 486 additional pre-60S population (Figure 5). We also observed a small population of cytoplasmic 487 60S targets with higher SNR values against the LN 60S template than against the mature 60S 488 We used the program prepare_stack_matchtemplate (Lucas et al., 2021) to generate a particle 547 stack using the locations and orientations of the significant mature 60S-detected targets after 548 refinement as described above. We then used cisTEM to generate a 3D reconstruction from 3991 549 mature 60S targets detected in 28 images of the nuclear periphery, only including targets with a 550 2DTM SNR of >8. The reconstruction had a nominal resolution of 3.5 Å using an Fourier Shell 551 Correlation (FSC) threshold of 0.143 (Figure 1-figure supplement 1D) (Rosenthal and 552 Henderson, 2003) that is expected to overestimate the resolution due to overfitting (Lucas et al., 553 2021). To limit the noise due to overfitting, we low-pass filtered the reconstruction to 10 Å, 554 representing an FSC of 0.9. 555 556

Calculating 2DTM SNR values and ratios of SNR values 557
Targets identified in two or more searches with aligned templates were identified using the 558 program align_coordinates (Lucas et al., 2021). The 2DTM SNRs of targets identified in two or 559 more searches were compared by taking the log2 of the SNR ratio. The log2 was used in place of 560 the direct ratio because, the shape of the distribution is independent of the order of comparison, 561 except for a mirror around 0, while the distribution of the direct ratios shows more complicated