cryoSPHERE: Single-particle heterogeneous reconstruction from cryo EM

The three-dimensional structure of a protein plays a key role in determining its function. Methods like AlphaFold have revolutionized protein structure prediction based only on the amino-acid sequence. However, proteins often appear in multiple different conformations, and it is highly relevant to resolve the full conformational distribution. Single-particle cryo-electron microscopy (cryo EM) is a powerful tool for capturing a large number of images of a given protein, frequently in different conformations (referred to as particles). The images are, however, very noisy projections of the protein, and traditional methods for cryo EM reconstruction are limited to recovering a single, or a few, conformations. In this paper, we introduce cryoSPHERE, a deep learning method that takes as input a nominal protein structure, e.g. from AlphaFold, learns how to divide it into segments, and how to move these as approximately rigid bodies to fit the different conformations present in the cryo EM dataset. This formulation is shown to provide enough constraints to recover meaningful reconstructions of single protein structures. This is illustrated in three examples where we show consistent improvements over the current state-of-the-art for heterogeneous reconstruction.


Introduction
Single-particle cryo-electron microscopy (cryo EM) is a powerful technique for determining the three-dimensional structure of biological macromolecules, including proteins.In a cryo EM experiment, millions of copies of the same protein are first frozen in a thin layer of vitreous ice and then imaged using an electron microscope.This yields a micrograph: a noisy image containing 2D projections of individual proteins.The protein projections are then located on this micrograph and cut out so that an experiment 1 Linköping University 2 Uppsala University.Correspondence to: Gabriel Ducrocq <gabriel.ducrocq@liu.se>.

Preliminary work.
typically yields 10 4 to 10 7 images of size N pix × N pix of individual proteins, referred to as particles.Our goal is to reconstruct the possible structures of the proteins given these images.
Frequently, proteins are conformationally heterogeneous and each copy represents a different structure.Conventionally, this information has been discarded, and all of the sampled structures were assumed to be in only one or a few conformations (homogeneous reconstruction).Here, we would like to recover all of the structures in a heterogeneous reconstruction.
Structure reconstruction from cryo EM presents a number of challenges.Firstly, each image shows a particle in a different, unknown orientation.Secondly, because of the way the electrons interact with the protein, the spectrum of the images is flipped and reduced.Mathematically, this corresponds to a convolution of each individual image with the Point Spread Function (PSF).Thirdly, the images typically have a very low signal-to-noise ratio (SNR), usually around 0.1, see e.g.Baxter et al. (2009).
For all these reasons, it is very challenging to perform de novo cryo EM reconstruction.Standard methods, produce electron densities averaged over many, if not all conformations (Scheres, 2012;Punjani et al., 2017), performing discrete heterogeneous reconstruction.More recent methods attempt to extract continuous conformational heterogeneity, e.g. by imposing constraints on the problem through an underlying structure deformed to fit the different conformations present in the dataset, see e.g Rosenbaum et al. (2021); Zhong et al. (2021b); Li et al. (2023).AlphaFold (Jumper et al., 2021) and RosettaFold (Baek et al., 2021) can provide such a structure based on the primary sequence of the protein only.In spite of this strong prior, it is still difficult to recover meaningful conformations.The amount of noise and the fact that we observe only 2D projections creates local minima that are difficult to escape (Zhong et al., 2021b;Rosenbaum et al., 2021), leading to unrealistic conformations.
To remedy this, we root our method in the observation that different conformations can often be explained by large scale movements of domains of the protein (Mardt et al., 2022).Specifically, we develop a variational auto-encoder (VAE) (Kingma & Welling, 2014) that, from a nominal structure and a set of cryo EM images: • Learns how to divide the amino-acid chain into segments, given a user defined maximum number of segments; see Figure 2. The nominal structure can for instance be obtained by AlphaFold (Jumper et al., 2021).
• For each image, learns approximately rigid transformations of the identified segments of the nominal structure, which effectively allows us to recover different conformations on an image-by-image (single particle) basis.
These two steps happen concurrently, and the model is endto-end differentiable.The model is illustrated in Figure 1.
Note that what we call a segment is conceptually different from a domain in the structural biology sense.The domains of a protein play a pivotal role in diverse functions, engaging in interactions with other proteins, DNA/RNA, or ligand, while also serving as catalytic sites that contribute significantly to the overall functionality of the protein, see e.g.Schulz & Schirmer (1979); Nelson et al. (2017).By comparison, the segments we learn do not necessarily have a biological function.However, while not strictly necessary for the function of the method, experiments in Section 4 show that our VAE often recovered the actual domains corresponding to different conformations.
The implementation of the model is available on github1 .

Notations and problem formulation
In what follows, S = {r i } i=1,...,Rres denotes a protein made of a number R res ∈ N ⋆ of residues r i .Each residue is itself a set of atoms: r i = {a k } k=1,...,Ai where A i ∈ N ⋆ is the number of atoms present in residue i.We can equivalently define the structure S = {a k } k=1,...,A total in terms of its atoms, where A total ∈ N ⋆ is the total number of atoms in the structure.We denote by } the coordinates and atom type of a k for k ∈ {1, . . ., A total }.This means that in this paper, we limit ourselves to the backbone of the protein: we consider only the C α , C β and N atoms.
The electron density map of a structure S, also called a volume, is a function V S : R 3 → R, where V S (x) represents the probability density function of an electron of S being present, evaluated at x ∈ R 3 .That is, if B ⊆ R 3 , the probability that the structure S has an electron in B is given by B V S (x)dx.
Assume we have a set of 2D images {I i } N i=1 of size N pix × N pix , representing 2D projections of different copies of the same protein in different conformations.Traditionally, the goal of cryo EM heterogeneous reconstruction has been to recover, for each image i, the electron density map V i corresponding to the underlying conformation present in image i. See Subsection 2.2 for a review these methods.However, following recent works, e.g., Rosenbaum et al. (2021); Zhong et al. (2021b), we aim at recovering, for each image i, the underlying structure S i explaining the image.That is, we try to recover the precise position in R 3 of each residue.
Since the different conformations of large proteins can often be explained by large scale movements of its domains (Mardt et al., 2022), our method learns to decompose the amino-acid chain of the protein into segments and, for each image I i , to rigidly move the learnt segments of a base structure S 0 to match the conformation present in that image, as explained in Section 1.
AlphaFold (Jumper et al., 2021) and RosettaFold (Baek et al., 2021) can provide such a structure S 0 , based on the amino-acid sequence of the protein at hand.In Section 4, we further fit the AlphaFold predicted structure into a volume recovered by a custom backprojection algorithm provided by Zhong et al. (2020).

Related work
Two of the most popular methods for cryo EM reconstruction, which are not based on deep learning, are RELION (Scheres, 2012) and cryoSPARC (Punjani et al., 2017).Both methods perform volume reconstruction, hypothesize that k conformations are present in the dataset and perform maximum a posteriori estimation over the k density maps, thus performing discrete heterogeneous reconstruction.Both of these algorithms operate in Fourier space using an expectation-maximization algorithm (Dempster et al., 1977) and are non-amortized: the poses are learnt for each image, and adding new images to the dataset would require a new run of the algorithm.Other approaches perform continuous heterogeneous reconstruction.For example, 3DVA (Punjani & Fleet, 2021b) uses a probabilistic principal component analysis model to learn a latent space while E2GMM (Chen & Ludtke, 2021) parameterizes the volume as a mixture of N Gaussian p.d.f and learns their mean conditionally on an input image.
Another class of methods involve deep learning and typically performs continuous heterogeneous reconstruction using a VAE architecture.Some of them attempt to reconstruct a density map, while others target the atomic structures -or a backbone -directly.For example, cryoDRGN (Zhong et al., 2020;2021a) and CryoAI (Levy et al., 2022) use a VAE acting on Fourier space to learn a latent space and a mapping  that associates a 3D density map with each latent variable.They perform non-amortized and amortized inference over the poses, respectively.Other methods are defined in the image space, e.g.3DFlex (Punjani & Fleet, 2021a) and cryoPoseNet (Nashed et al., 2021).They both perform nonamortized inference over the poses.These methods either learn, for a given image I i , {V i (x k )} the values at a set of N 3 pix fixed 3D coordinates {x k }, representing the volume on a grid (explicit parameterization), or an actual function Vi : R 3 → R in the form of a neural network that can be queried at chosen coordinates (implicit parameterization).The implicit parameterization requires smaller networks but needs an expensive N 2 pix and N 3 pix evaluation for image and volume generation, respectively.
Other deep learning methods attempt to directly reconstruct structures instead of volumes and share a common process: starting from a plausible base structure, obtained with e.g.AlphaFold (Jumper et al., 2021), for each image, they move each residue of the base structure to fit the conformation present in that specific image.These methods differ on how they parameterize the structure and in the prior they impose on the deformed structure or the motion of the residues.For example AtomVAE (Rosenbaum et al., 2021) considers only residues and penalizes the distances between two subsequent residues that deviate too much from an expected value.CryoFold (Zhong et al., 2021b) considers the residue centers and their side-chain and also imposes a loss on the distances between subsequent residues and the distances between the residue centers and their side-chain.Unfortunately, due to the high level of noise and the fact that we observe only projections of the structures, these "per-residue transformation" methods tend to be stuck in local minima, yielding unrealistic conformations unless the base structure is taken from the distribution of conformations present in the images (Zhong et al., 2021b), limiting their applicability to real datasets.Even though AtomVAE (Rosenbaum et al., 2021) could roughly approximate the distribution of states of the protein, it was not able to recover the conformation given a specific image.In addition, to our knowledge, the code for these two methods is not available and non-trivial to reimplement.This prevents a comparison of our results with AtomVAE and CryoFold.To reduce the bias that the base structure brings, (Schwab et al., 2023) recently proposed to fit pseudo-atoms in a consensus map with a neural network directly.Similar to our work, several other methods constrain the atomic model to rigid body motions.For example (Chen et al., 2023), starting from a base structure S big fitted in the consensus map (called the big GMM), learn a representation structure S small of this consensus map using a smaller number using the method described in (Schwab et al., 2023).They subsequently learn a translation on each residue S small , that contribute to translate the residues of S big weighted by how close they are from the residues of S small .This can be thought of as our segmentation GMM, except that it takes places in the 3D space and only the centers of the Gausian modes can be learnt, while ours is fitted on the residue indexes and all of the GMM parameters are learnable.In addition, they do not use this segmentation to perform rigid body motions.Instead, in a subsequent step, the user is in charge of identifying the rigidly moving segments of the protein by applying one or several masks to the protein.This is in contrast to our method where the segmentation used to perform rigid body motion is learnt by the network trying to move the domains to fit the images.Finally, this method involves a sequence of optimization problem, with potentially different architecture, while ours is end-to-end differentiable.Using the method in (Schwab et al., 2023), Chen et al. (2024) develop a focused refinement on patches of the GMM representation of the protein.These patches are learnt using k-means on the location of residues and do not depend on the different conformations of the data set.This in contrast to cryoSPHERE which learns the segments of the protein are tightly linked to the change of conformation.Concurrently to our work, Li et al. (2023) developed CryoSTAR which acts on each residue and enforces local rigidity of the motion of the protein by imposing a similarity loss between the base structure and the deformed structure.The interested reader can see Donnat et al. (2022) for an in-depth review of deep learning methods for cryo EM reconstruction.
The reconstruction methods relying on an atomic model, such as e.g cryoSTAR, DynaMight or cryoSPHERE offer the possibility to the user to provide prior information via this atomic model.They also offer the possibility of deforming the protein according to chemical force fields.This is not the case of the methods performing volume reconstruction without such an atomic model.

Method -cryoSPHERE
In this section, we present our method for single-particle heterogeneous reconstruction, denoted cryoSPHERE.The method focuses on structure instead of volume reconstruction.It differs from the previous works along this line in the way the movements of the residues are constrained: instead of deforming the base structure on a residue level and then imposing a loss on the reconstructed structure, it moves parts of the protein approximately rigidly by construction.
We use a typical VAE architecture, see Figure 1 for a depiction of how the information flows in our network.We map each image to a latent variable by a stochastic encoder.The latent variable is then passed to a decoder which outputs a transformation (rotation and translation) per segment.Based on these transformations and segment decomposition, the underlying structure S 0 is deformed and then turned into a volume.Finally, the volume is projected into a 2D image, according to its pose that can be estimated using existing software, e.g.(Scheres, 2012;Punjani et al., 2017).Finally, this image is convolved with the PSF to give the reconstructed image that is compared to the input image.This completes the forward pass.In the backward pass, the parameters of the encoder, decoder and Gaussian mixture are updated with a gradient descent based on the evidence lower bound (see Section 3.2).In the following, we describe details of our network: the encoder and decoder, the segment decomposition, how to turn a structure into a 2D projection of its electron density, and an efficient way to compare the true and predicted images.

Image formation model
To compute the 2D projection of the protein structure S, we first estimate its 3D electron density map V : where r ∈ R 3 and σ is a positive real number.In this definition the protein's electron density is approximated as the sum of a Gaussian p.d.f centered on each atom.We use a fixed value of σ = 1, following Rosenbaum et al. (2021).
From these density maps, we then compute an image projection I ∈ R Npix×Npix as: where (r x , r y ) ∈ R 2 are the coordinates of a pixel, r z ∈ R is the coordinate along the z axis, R ∈ SO( 3) is a rotation matrix and t ∈ R 3 is a translation vector.The abuse of notation RS + t means that every atom of S is rotated according to R and then translated according to t.The image is finally convolved with the point spread function (PSF) g, which in Fourier space is the contrast transfer function (CTF), see Vulović et al. (2013).
In practice, we avoid approximating the integral in Equation (2).The density map, being a sum of Gaussian p.d.f, allows for an exact calculation by integrating over the last marginal.This significantly reduces the computing time.

Maximum likelihood with variational inference
To learn a distribution of the different conformations, we hypothesize that the conformation seen in image I i depends on a latent variable z i ∈ R L , with prior p(z i ).Let f θ (S 0 , z) be a function which, for a given base structure S 0 and latent variable z, outputs a new transformed structure S.This function depends on a set of learnable parameters θ.Then, the conditional likelihood of an image I ⋆ ∈ R Npix×Npix with a pose given by a rotation matrix R and a translation vector t is modeled as , where σ 2 noise is the variance of the observation noise.The marginal likelihood is thus given by In practice, the pose (R, t) of a given image is unknown.However, following similar works (Zhong et al., 2021b;Li et al., 2023), we suppose that we can estimate R and t to sufficient accuracy using off-the-shelf methods (Scheres, 2012;Punjani et al., 2017).
Directly maximizing the likelihood ( 3) is infeasible because one needs to marginalize over the latent variable.For this reason, we adopt the VAE framework, conducting variational inference on p θ (z|I ⋆ ) ∝ p θ (I ⋆ |z)p(z), and simultaneously performing maximum likelihood estimation on the parameters θ.
Let q ψ (z|I ⋆ ) denote an approximate posterior distribution over the latent variables.We can then maximize the evidence lower-bound (ELBO): which lower bounds the log-likelihood log p θ (I ⋆ ).Here D KL denotes the Kullback-Leibler (KL) divergence.In this framework f θ is called the decoder and q ψ (z|I ⋆ ) the encoder.

Segment decomposition
We fix a maximum number of segments N segm ∈ {1, . . ., R res } and we represent the decomposition of the protein by a stochastic matrix G ∈ R Rres×Nsegm .The rows of G represent "how much of each residue belongs to each segment", and our objective is to ensure that each residue mostly belongs to one segment, that is: We also aim for the segments to respect the sequential structure of the amino acid chain, and the model to be end-toend differentiable.Without end-to-end differentiability, we could not apply the reparameterization trick and we would have to resort on Monte-Carlo estimation of the gradient of the segments, which has a higher variance, see e.g (Mohamed et al., 2019).
To meet these criteria, we fit a Gaussian mixture model (GMM) with N segm components on the real line supporting the residue indices, that is, on R. Each component m has a mean µ m , standard deviation σ m and a softmax weight α m .
The {α m } are passed into a softmax to obtain the weights {π m } of the GMM, ensuring they are positive and summing to one.We define the probability that a residue i belongs to segment m as the probability that the index i belongs to mode m, annealed by a temperature τ ∈ R +⋆ : where ϕ(x|µ, σ 2 ) is the unidimensional Gaussian probability density function with mean µ and variance σ 2 and τ is a fixed hyperparameter.If τ is sufficiently large, we can expect condition (5) to be verified.See Figure 2 for an example of a segment decomposition using a Gaussian mixture.In this "soft" decomposition of the protein, each residue can belong to more than one segment, allowing for smooth deformations.In addition, the differentiable architecture is amenable to gradient descent methods, and a well chosen τ can approximate a "hard" decomposition of the protein.
We set τ = 20 in the experiment section.In our experience, this segmentation procedure is very robust to different initialization and converge in only a few epochs.

Decoder architecture
The decoder describes the distribution of the images given the latent variables.The latent variables include: 1.One latent variable z i ∈ R L per image, parameterizing the conformation.
Given these latent variables and a base structure S 0 , we parameterize the decoder f θ in three steps.
Second, given the parameters of the GMM, we can compute the matrix G. Finally, for each residue i of S 0 , we update the coordinates of all its atoms {a ik } k=1,...,Ai : 1. First, a ik is successively rotated around the axis ⃗ ϕ m with an angle G im δ m for m ∈ {1, . . ., N segm } to obtain updated coordinates a ′ ik .2. Second, it is translated according to: This way, the rotation angles and translations for a residue incorporate contributions from all segments, proportionally on how much they belong to the segments.If condition (5) is met, a roughly rigid motion for each segment can be expected.

Encoder and priors
We follow the classical variational-autoencoder framework.
The distribution q ψ (y|I ⋆ ) is given by a normal distribution N (µ(I ⋆ ), diag(σ 2 (I ⋆ ))) where µ ∈ R L and σ ∈ R L + are generated by a neural network with parameters ψ, taking an image I ⋆ as input.Additionaly, the approximate posterior distribution on the parameters of the GMM is chosen to be Gaussian and independent of the input image: αm ) where {ν µm , β µm , ν σm , β σm , ν αm , β αm } m=1,...,Nsegm are parameters that are directly optimized and in practice we feed σ m to an ELU+1 layer in order to avoid a negative or null standard deviation.
Note that applying the reparameterization trick (Kingma & Welling, 2014) is straightforward for a Gaussian distribution.Calculating the KL-divergence between two Gaussian distributions as in Equation ( 4), is also straightforward.

Experiments
In this section, we test cryoSPHERE on different datasets and compare the results to cryoDRGN (Zhong et al., 2020).CryoDRGN is a state-of-the-art method for continuous heterogeneous reconstruction, in which the refinement occurs at the level of electron densities.To synthesize data for all experiments, we applied CTF corruption to the images and add Gaussian noise to achieve SNR ≈ 0.1.See Appendix A for a detailed description of the CTF.Furthermore, we make the assumption that we know the exact poses.However, since we use a structure S 0 , which is different from the structures used to generate the datasets, these poses can only be an approximation.This will not be the case of cryo-DRGN, for which these poses will indeed be exact, as this method does not use a base structure.Consequently, in this context, the comparison may introduce a bias in favour of cryoDRGN.

Toy dataset: a continuous motion.
For an initial assessment of cryoSPHERE, we examine a simulated dataset based on a bacterial phytochrome with UniProt (The UniProt Consortium, 2021) entry Q9RZA4.This protein is a dimer, with 755 residues per chain.To simulate a continuous heterogeneous set of images, we predicted the structure with AlphaFold multimer (Evans et al., 2021) and divided the protein into two domains.The first one consists of the full chain A and 598 residues of chain B, 1353 residues in total.The second domain comprises the last 157 residues of the chain B, see Figure 3.We then sample 10 4 rotation angles from a two-components Gaussian mixture.Each conformation is generated by rotating the second domain according to the axis (0, 1, 0) and one of the sampled rotation angles.This yields 10 4 different conformations.Finally, we uniformly sample 15 different poses for each structure, yielding a total of 150k images.Appendix A.1 provides more details, and Figure 3 shows examples of structures.
Testing the segment decomposition, we then run cryoSPHERE by requesting division into N segm = 4.The program learnt a first and third segment with 0 residues, a second segment with 1353 residues and a fourth segment with 157 residues (Figure 3).Thus, cryoSPHERE learnt segments according to the ground truth.
Moreover, Figure 3 shows that most of the predicted angles of rotation for the fourth segment are in excellent agreement with the ground truth structural changes.However, approximately 12 % of the dataset is predicted as belonging to the wrong mode.Figure 8 in the appendix shows that the norms of the predicted translations for both segments are negligible, also indicating that cryoSPHERE has recovered the ground truth.The same figure shows that we recover the right axis of rotation.In Appendix A.1, we plot the predicted angles against the mean of the latent variable distributions output by the encoder.We observe that cryoSPHERE has recovered the circular motion: the greater the latent mean, the lower the predicted angle, with a quick transition from one mode to the other.This indicates that cryoSPHERE is able to recover the structure of single-particles from cryo EM data.

A discrete conformation dataset: the SARS-CoV-2 spike protein
We now test cryoSPHERE on a dataset comprising two discrete conformations of the SARS-CoV-2 spike protein: a close conformation with PDB entry 6ZP5 and an open conformation with PDB entry 6ZP7, see Melero et al. (2020).
We generate 150k images of size N pix = 300 with uniformly sampled poses and neglected translations, with seventy percent corresponding to the close conformation and thirty percent to the open one.
Prior to running cryoSPHERE, we obtain a base structure by first running the multimer version of AlphaFold (Evans et al., 2021) on the amino acid sequence of the spike protein, selecting the best (according to AlphaFold) structures.Second, we utilize a custom backprojection algorithm provided by (Zhong et al., 2020) to get a homogeneous reconstruction.This yields a volume with low resolution in the mobile parts but decent resolution in the less mobile ones.To reduce a potential bias introduced by using an AlphaFold structure, we subsequently fit this structure into the volume, initially performing rigid-body fitting in ChimeraX (Meng et al., 2023) and then using ISOLDE (Croll, 2018).Subsequently, we run cryoSPHERE with this fitted structure as the base structure for 48 hours on a single GPU, totalling 174 epochs.
We define the two conformations in terms of the angle between two domains, see Figure 4.The same figure demonstrates that we recover the angles within roughly 1 degree.
After clustering the structures predicted by cryoSPHERE based on their angle defined in Figure 4, only 4 of them get predicted to be in the wrong conformation.Additionally, by turning the structures predicted by cryoSPHERE into volumes using our operator (1), we observe excellent agreement with the ground truth volumes, as shown by the Fourier shell correlation (FSC) curves (Klaholz, 2015).This illustrates that cryoSPHERE can effectively recover discrete conformational disorder.

Molecular dynamics dataset: bacterial phytochrome.
We now test cryoSPHERE with a more difficult dataset.This time, we simulate a continuous motion of a bacterial phytochrome, with PDB entry 4Q0J (Burgie et al., 2014).This dimer has 503 residues per chain.We define two upper domains corresponding to the residues 321-502 from both chain A and B.Then, we run a Molecular Dynamics (MD) simulation to simulate a dissociation of these two domains, as depicted in Figure 9 in Appendix A.2.We take 10 4 structures along the trajectory and sample 15 poses uniformly for each structure.Refer to Appendix A.2 for more details.
Analogously to Section 4.2, we obtain an AlphaFold structure that we fit into a volume reconstructed by backprojection.Further details can be found in Appendix A.2. Finally, we run our method with the fitted AlphaFold structure as our base structure and N segm = 20.
Since cryoSPHERE outputs structures, we can compute the predicted distances between the above-mentioned domains.
We plot the true and recovered histograms of the distances between the two domains in Figure 5. Additionally, we plot the predicted distances against the true distances in the same figure.These two plots show that the predicted structures are in excellent agreement with the ground truth structures.This is even true for the most dissociated structures, even though the majority of the structures are in a closed conformation.
Figure 6 shows the recovered segments as well as a selection of predicted structures and the corresponding ground truths.
Our model has learnt to separate the two mobile top domains from the fix bottom one.
For comparison, we train cryoDRGN for the same duration using the same single GPU.We use the default settings of cryoDRGN, as detailed in Appendix A.2.
We plot the mean of the FSC curves between the predicted volumes and the corresponding ground truth volumes in Figure 6, for both cryoSPHERE and cryoDRGN.This plot shows that we perform better than cryoDRGN at both the 0.5 and 0.143 cutoffs.There are two reasons behind this.
First, we fit our base structure into a volume reconstructed from a homogeneous reconstruction method.This aligns the non-moving parts of the protein with the right places, as provided by the dataset.Without this step, it is likely that some medium-scale part of the protein would have been misplaced, potentially reducing the performance of our algorithm at the 0.5 cutoff; see Appendix A.2 for evidence of this.It is important to note, that this step significantly improves our results, and it comes at almost no extra cost since it relies on the backprojection algorithm.
Secondly, deforming a good AlphaFold structure offers finer details than cryoDRGN.This accounts for the better performance of cryoSPHERE over cryoDRGN at the 0.143 cutoff.Refer to Appendix A.2 for examples of reconstructed structures and volumes.
Moreover, our method demonstrates robustness to a limited number of images.Even with only one pose per structure (totalling 10 4 images), cryoSPHERE maintains similar performance, while cryoDRGN suffers a significant drop in the FSC curves.Further details can be found in Appendix A.2.

Discussion
CryoSPHERE presents several advantages compared to other methods for volume and structure reconstruction: Efficiency in Deformation: Deforming a base structure into a density map avoids the computationally expensive N 2 pix evaluation required by a decoder neural network in methods implicitly parameterising the grid, such as Zhong et al. (2021a); Levy et al. (2022).The neural network size is vastly reduced compared to state of the art methods, which explicitly parameterise the density grid of size N 3 pix , see e.g.Punjani & Fleet (2023).Furthermore, direct deformation of a structure directly avoids the need for subsequent fitting into the recovered density map.
Reduced Dimensionality and Noise Resilience: Learning one rigid transformation per segment, where the number of segments is much smaller than the number of residues reduces the dimensionality of the problem.This results in a smaller neural network size compared to approaches acting on each residues, such as Rosenbaum et al. (2021).Rigidly moving large portions of the protein corresponds to low-frequency movements, less prone to noise pollution than the high-frequency movements associated with moving each residue independently.In addition, since our goal is to learn one rotation and one translation per segment, a latent variable of dimension 6 × N segm is, in principle, a sufficiently flexible choice to model any transformation of the base structure.Choosing the latent dimension is more difficult for volume reconstruction methods such as (Zhong et al., 2021a).
Interpretability: CryoSPHERE outputs segments along with one rotation and one translation per segment, providing valuable and interpretable information.Practitioners can easily interpret how different parts are moving based on the transformations the network outputs.This interpretability is often challenging for deep learning models such as Zhong et al. (2021a); Rosenbaum et al. (2021).Section 4 demonstrates cryoSPHERE's capability to recover continuous and discrete heterogeneity while performing structure reconstruction.It also demonstrate that even though the choice of N segm impact the FSC to the ground truth, cryoSPHERE is able to recover the correct motion for the entire range of N segm values.It is also able to keep the minimum necessary number of domains when the user sets an unnecessarily high N segm , as seen in Section 4.1.
While using a base structure S 0 provides detailed information missed by cryoDRGN, especially evidenced by our better 0.143 cutoffs, it can introduce bias on medium scale elements, especially if N segm is too small.We alleviate this problem by fitting S 0 into a volume reconstructed using a simple backprojection algorithm.This, combined with cryoSPHERE, achieves better 0.5 cutoffs than cryoDRGN, showcasing its effectiveness in debiasing the results.If such a volume is unavailable, simply increasing N segm can reduce the bias.
Our study opens up for significant advancements in predicting protein ensembles and dynamics, critically important for unraveling the complexity of biological systems.By predicting all-atom structures from cryo EM datasets through more realistic deformations, our work lays the foundation for extracting direct insights into thermodynamic and kinetic properties.In the future, we anticipate the ability to predict rare and high-energy intermediate states, along with their kinetics, a feat beyond the reach of conventional methods such as molecular dynamics simulations.
It would be interesting to assess how much our segmentation correlates with bottom-up segmentation into domains conducted on the "omics" scale, see e.g (Lau et al., 2023).To achieve this quantitatively, we would need many examples of moving segments from cryo EM investigations to match the millions of segments from the "omics" studies.Therefore, we leave this investigation to later work.
Finally, this work is an important milestone in showing that one can learn a segmentation of the protein that is intimately linked to the change of conformation of the underlying protein, in an end-to-end fashion.This is in contrast to other works that perform the segmentation based on residue distances of the underlying atomic model only, usually not in an end-to-end way.The application of cryoSPHERE to real dataset requires a number of careful choices that we leave to future work.

A. Experiments
In this appendix, we provide more details on the experiments of Section 4.
In all the experiments of Section 4, each posed structure is converted into a volume, which is then projected into 2D images according to our image formation model in (2), with σ = 1.The same CTF corruption is applied with parameters specified in Table 1 and finally noise is added to achieve a SNR equal to 0.1.Here, SNR is defined as the ratio of the variance of the images to the variance of the noise.

A.1. Toy dataset
For this experiment, we predict the phytochrome structure using AlphaFold multimer (Evans et al., 2021) on its amino acid sequence with the UniProt (The UniProt Consortium, 2021) entry Q9RZA4.This protein forms a dimer with 755 residues on each chain.We define two domains for simulation purposes.The first domain comprises the first chain and the first 598 residues of the second chain.The second domain consists of the remaining 157 residues of the second chain.In Figure 3, we present the base structure, the decomposition into the two domains as well as the structures corresponding to the deformed base.These deformations represent the mean rotation of each mode.We rotate the second domain around the (0, 1, 0) axis, sampling 10 4 rotation angles from the Gaussian mixture: This gives 10 4 structures.For each image, we uniformly sample 15 rotation poses and 15 translation poses on [−10, 10] 2 .The structure undergoes rotation, translation, and is then turned into an image according to image formation model 2. Subsequently, the images undergo CTF corruption, and noise is added to achieve SNR ≈ 0.1.This process generates a total of 150k images, each of size N pix = 220.
We run cryoSPHERE with N segm = 4 for 48 hours on a single NVIDIA A100 GPU, equivalent to 779 epochs.
Due to computational constraints, the plots in Subsection 4.1 are based on only 10000 images, one per conformation.
Figure 7 illustrates the predicted angles against the latent means, demonstrating that the model effectively learns rotational motion.
Additionally, in Figure 8, we plot the distribution of predicted translation norms for the two segments, the predicted angles of rotation for segment 2 and the distribution of the dot product between the predicted axis of rotation for segment 4 and the true axis of rotation (0, 1, 0).Our model predicts no translation for each segment and no rotation for the second segment.
The axis of rotation is almost exactly recovered.

A.2. Molecular dynamics dataset
We take the structure of a phytochrome with PDB ID 4Q0J (Burgie et al., 2014) and define two domains: residues 321 to 502 of the first chain and residues 321 to 502 of the second chain.To simulate the dissociation process of the two upper domains, we perform Metadynamics (Barducci et al., 2011) simulations in GROMACS (Abraham et al., 2015;Pronk et al., 2013) with the PLUMED 2 implementation (Tribello et al., 2014).The collective variable chosen is the distance between the self-defined centers of mass (COMs) of the upper domains (residues 321-502 of chain A and B).A 100 ns simulation is conducted using the NpT ensemble, maintaining pressure control through the Parrinello-Rahman barostat.Gaussian deposition occurs every 5000 steps, featuring a height of 0.1 kJ/mol and a width of 0.05 nm.Afterwards, we extract 10 4  structures along the dry trajectory.See Figure 9 for examples of structures.From this, two datasets are generated.
The first dataset consists 15 rotation poses uniformly sampled per structure.Translation poses are neglected in this data set, resulting in a total of 150k images.
The same process is followed to generate the second dataset, with the exception that only one rotation pose per structure is assumed.This results in 10k images.
For both datasets, we use N pix = 190, and each pixel is of size 1 Å.The encoder is a 4-hidden-layer neural network with fully connected hidden layers of dimension of 2048, 1024, 512, 512.The decoder is a 2-hidden-layer neural network with fully connected hidden layers of dimension 350, 350.We set the batch size to 100 and use the ADAM optimizer (Kingma & Ba, 2017) with a learning rate of 0.0003, unless otherwise stated.The latent dimension is set to 40 with N segm = 6 for all runs, unless stated otherwise.For the 150k dataset, we train for 24 hours on a single NVIDIA A100 GPU, corresponding to 642 epochs.The 10k dataset is trained for 12 hours on the same single GPU, corresponding to 4114 epochs.
To compare cryoSPHERE to cryoDRGN, we train cryoDRGN using default settings -a batch size of 8 and a latent dimension equal to 8 -for the same duration, on the same single GPU used for our algorithm.To maintain consistency when comparing volumes generated from cryoDRGN, our methods, and the ground truth, we convert the structures (both ground truth and predicted with cryoSPHERE) into volumes using the same image formation model employed to generate the dataset.
For computational efficiency, the FSC plots and distances plots are not based on all structures.In the case of the 150k dataset, only one image per structure is used to compute the distances.All images in the 10k dataset are utilized.Consequently, the distance plots are based on 10k images in both cases.For the computation of the FSC curves, we select only 1000 images evenly distributed among these 10000 structures.
A.2.1.150K DATASET Now, we test the robustness of our method to the choice of a base structure.
First, we perform two runs with N segm = 6.For the first run, we use the same AlphaFold structure as in section 4.3, before it was fitted into the volume obtained by backprojection.In the second run, we use the same AlphaFold structure but fitted as described in Section 4.3.
Figure 10 displays the results for the unfitted run.Despite cryoSPHERE showing excellent agreement with the ground truth in terms of distances compared to cryoDRGN, it misses some medium scale elements, while maintaining a better level of small scale details.We observe underperformance at the 0.5 cutoff but outperformance at the 0.143 cutoff.For example, considering the ground truth volumes of the first MD structure and the corresponding reconstructed volumes from cryoSPHERE and cryoDRGN in Figure 13, we note that when only using an AlphaFold structure without postprocessing, our method can be slightly off on some medium-scale elements, especially at the bottom of the protein.As hypothesizing N segm = 6 is not sufficient to break these elements in specific segments, cryoSPHERE does not move them, retaining the features of the AlphaFold structure, potentially misplaced.In contrast, cryoDRGN seems to be closer to the ground truth on these elements.However, upon closer inspection of the reconstructed volumes, we observe that cryoSPHERE gives much finer details throughout the volume, a charecteristic not replicated by cryoDRGN.
Figure 11 showcases the results for the fitted run.CryoSPHERE now outperforms cryoDRGN at both the 0.5 and 0.143 cutoffs while maintaining an excellent agreement with the ground truth.The fitting of the AlphaFold structure has a significantly positive impact on the performances of our method.
For both runs, we compare the reconstructed volumes of the same image from our method to the ground truth in Figure 14.We observe that throughout the volume, the parts that were off compared to the ground truth with the unfitted AlphaFold structure are much better placed with the fitted AlphaFold structure.
Next, we conduct two additional runs with the 150k dataset: one using the first structure from the MD simulation and the other using the last structure of the MD simulation.Figure 16 displays the results for the first MD structure as base structure, showcasing an excellent agreement with the ground truth.In addition, the FSC curves experience a drop compared to the fitted AlphaFold structure, but they are similar to the unfitted AlphaFold structure.This is likely because using a base structure belonging to the dataset introduces less bias than using an AlphaFold-generated structure with no fitting.Figure 17 presents the same plot, starting from the last MD structures.The conclusions are the same, although we tend to overestimate the distances to some extent.However, the FSC curves are significantly improved and now match the fitted AlphaFold structure run.
Figure 18 shows the segment decompositions recovered by cryoSPHERE for all these runs, revealing their overall similarity.Notably, two small segments are identified in the helices linking the upper part to the lower one.This is not problematic since our decomposition is soft, allowing for smooth transitions between no motion and the motion of the upper domains.
As expected, the quality of the results depends on the quality of the base structure used.Fortunately, AlphaFold can provide    reasonable structures that we can further fit into density maps obtained with traditional methods.This approach combines the strength of both world: the more static parts of a protein can exhibit good resolution with traditional tools, and we can fit in the corresponding part of an AlphaFold structure, thereby debiasing the method to some extent.On the more mobile parts where traditional methods may have lower resolution, AlphaFold can provide improved resolution.
In a final experiment, we attempt to increase the number of segments to N segm = 16 using the unfitted AlphaFold structure as the base structure.We run our algorithm for 24 hours on the same single GPU used in previous experiments.Figure 12 demonstrates an excellent agreement with the ground truth in terms of distances.The FSC curves indicate that increasing the number of segments actually improves the results, approaching the performance of the run with the fitted AlphaFold structure, despite using the unfitted AlphaFold structure.Figure 15 visually confirms this improvement.As expected, more segments allow the model to adjust better to the data, reducing the bias introduced by the AlphaFold structure.However, this improvement comes at a cost: in 24 hours, cryoSPHERE with the unfitted AlphaFold structure and 6 segments makes 642 epochs, whereas the run with 16 segments completes 421 epochs.

A.2.2. 10K DATASET
To assess the performance under data conditions, we run our method with only 10 4 images, N segm = 6 and the same unfitted AlphaFold base structure and settings as in subsection A.2.1.We train our network and cryoDRGN for only 12 hours only.
Figure 19 demonstrates that cryoSPHERE maintains an excellent agreement with the ground truth and is not significantly impacted by the scarcity of data.On the contrary, the FSC curves indicate that the performance of cryoDRGN is affected.This is further confirmed by Figure 20, where cryoSPHERE has roughly the same FSC curves between the 10k and 150k datasets, while cryoDRGN experiences a significant drop in performance.This is likely because the AlphaFold structure     provides strong prior knowledge.
Additionaly, we plot the recovered segments in Figure 18.Despite having fewer images, our method succesfully divides the protein into meaningful parts for the given task.

Figure 1 .
Figure1.Flow chart of our network.The learnable parts of the model are the encoder, the decoder and the Gaussian mixture.Note that even though the transformations predicted by the decoder are on a per image basis, that is not the case of the Gaussian mixture, which is shared across all particles.

Figure 2 .
Figure 2. Example of segments recovered with a Gaussian mixture of 6 components.

Figure 3 .
Figure 3. Toy dataset.Leftmost: Histograms of the predicted and true angles of rotation in radians.The true angles are in green.The recovered distances are in blue.Left: Predicted against true angles in Ångström.The black line represent x = y.Middle: Base structure with the two domains, also used to generate the images.The domain in blue is rotated according to the axis (0, 1, 0) with angles of rotations sampled from the green distribution on the leftmost figure.Note that the segments predicted by cryoSPHERE exactly match the two domains.The fourth segment is in blue and corresponds to the last 157 residues of chain B, matching exactly the ground truth domain.Right: First mode structure.Rightmost: Second mode structure.

Figure 4 .
Figure 4. Spike dataset.Left: Predicted angles.Middle left: FSC curves for the open and close conformations.Middle: Close conformation.Middle right: Open conformation.Right: Definition of the angles.

Figure 5 .
Figure 5. MD dataset.Left: Histograms of the distances of the two upper domains.The true distances are in green.The recovered distances are in blue.Right: Predicted against true distances in Ångström.The black line represent x = y.

Figure 6 .
Figure 6.MD dataset.Left: Recovered segments on the fitted AlphaFold structure.The colors denotes different contiguous domains: if two segments share the same color, it does not mean their residues belong to each others.Middle left to middle right: cryoSPHERE predicted structures in green, corresponding ground truth in blue.Right: Fourier shell correlation curves.The mean across structures for cryoSPHERE is in plain blue.Dotted blue represents the mean ± one standard deviation.The red, plain line represents the mean for cryoDRGN across volumes, with the dotted red line representing the mean ± one standard deviation.

Figure 7 .
Figure 7. Toy dataset.Predicted angle against latent mean for cryoSPHERE.Note that for clarity 0.3 percent of the points were removed.

Figure 8 .
Figure 8. Toy dataset.Leftmost: Boxplot of the norm of the translations predicted for the fourth segment.Middle left: Boxplot of the norm of the translations predicted for the second segment.Middle right: Boxplot of the predicted angle of rotation for the second segment, in radians.Rightmost: Boxplot of the dot product between the predicted axis of rotation for the fourth segment and the true angle of rotation.CryoSPHERE recovers the right axis of rotation almost perfectly.

Figure 9 .
Figure 9. From left to right.1/ The starting close conformation of the MD simulation.2/ A medium-open structure arising from the MD simulation.3/ An open conformation towards the end of the MD simulation.4/ The AlphaFold structure used as a base structure.

Figure 10 .
Figure 10.MD 150k dataset unfitted AlphaFold structure.Left: Histograms of the distances between the two upper domains.The true distances are in green.The recovered distances are in blue.Middle: Predicted against true distances in Ångström.The black line corresponds to x = y.Right: Plot of the Fourier shell correlation curves.The mean across structures for cryoSPHERE is in plain blue.Dotted blue represents the mean ± one standard deviation.The red, plain line represents the mean of cryoDRGN across volumes, with the dotted red line representing the mean ± one standard deviation.

Figure 11 .
Figure 11.MD 150k dataset fitted AlphaFold structure.Left: Histograms of the distances between the two upper domains.The true distances are in green.The recovered distances are in blue.Middle: predicted against true distances in Ångström.The black line corresponds to x = y.Right: Plot of the Fourier shell correlation curves.The mean across structures is in plain blue.Dotted blue represents the mean ± one standard deviation.The red, plain line represents the mean of cryoDRGN across volumes, with the dotted red line representing the mean ± one standard deviation.

Figure 12 .
Figure 12. 150k dataset with the unfitted AlphaFold structure and 16 segments.Left: Histograms of the distances of the two upper domains.The true distances are in green.The recovered distances are in blue.Middle: predicted against true distances in Ångström.The black line corresponds to x = y.Right: Fourier shell correlation curves for cryoSPHERE.The mean of of the unfitted AlphaFold with 6 segments is in blue.Dotted blue represents the mean ± one standard deviation.The red, plain line represents the mean of the unfitted AlphaFold run with 16 segments, with the dotted red line representing the mean ± one standard deviation.The green, plain line represents the mean of the run with the fitted AlphaFold structure and 6 segments.

Figure 13 .
Figure 13.Left: comparison of a ground truth volume and the corresponding reconstructed volume from cryoSPHERE.Gray is cryoSPHERE, purple is the ground truth.Middle left: comparison of a ground truth volume and the corresponding reconstructed volume from cryoDRGN.Pink is cryoDRGN, purple is the ground truth.Middle right: cryoSPHERE.Right: cryoDRGN.

Figure 14 .
Figure 14.left: comparison of a ground truth volume and the corresponding reconstructed volume from cryoSPHERE with the unfitted AlphaFold structure.Gray is cryoSPHERE, purple is the ground truth.Middle left: comparison of a ground truth volume and the corresponding reconstructed volume from cryoSPHERE with the fitted Alphafols structure.Blue is fitted, purple is the ground truth.Middle right: cryoSPHERE without fitting.Right: cryoSPHERE with fitting.

Figure 15 .
Figure15.Comparison of volumes for an image of the MD 150k dataset.Gray is cryoSPHERE with unfitted AlphaFold structure and 6 segments.Tan is the same with 16 segments.Blue is ground truth.Having 16 segments leads to a volume matching the ground truth much better than 6 segments.This is true for medium scale elements throughout the volume, even at the bottom of the protein, which tends to stay still.Have a look at the coil on the middle right of the figure.Having 16 segments allows the method to break this coil into a segment and to adjust its position.Which is not the case with only 6 segments.

Figure 16 .
Figure 16.Plots for the MD 150k dataset starting from the first MD structure.Left: histograms of the true and predicted distances.Middle:predicted against true distances.Right: mean FSC curves for the AlphaFold base structure and the first MD structure as a base structure.

Figure 17 .
Figure 17.Plots for the MD 150k dataset with the last MD structure.Left: histograms of the true and predicted distances.Middle: Predicted against true distances.Right: Mean FSC curves for the AlphaFold base structure and the last MD structure as a base structure.

Figure 18 .
Figure 18.Examples of recovered segments.From left to right: 10k dataset, 6 segments.150k dataset unfitted AlphaFold structure, 6 segments.150k dataset with the first MD structure as the base structure, 6 segments.150k dataset with the last MD structure as the base structure, 6 segments.150k dataset unfitted AlphaFold structure, 16 segments.

Figure 19 .
Figure 19.For the MD 10k dataset.Left: Histograms of the true and predicted distances.Middle: Predicted against true distances.Right: mean FSC curves for the cryoSPHERE with the same base AlphaFold structure as the 150k dataset and cryoDRGN ran on this 10k dataset.The blue lines are for cryoSPHERE, the red lines for cryoDRGN.

Table 1 .
Table of the parameters used to CTF corrupt the generated images.The same values were used for all images of all datasets.