Abstract
Cryogenic electron microscopy (cryo-EM) has now been widely used for determining multi-chain protein complexes. However, modeling a complex structure is challenging particularly when the map resolution is low, typically in the intermediate resolution range of 5 to 10 Å. Within this resolution range, even accurate structure fitting is difficult, let alone de novo modeling. To address this challenge, here we present DiffModeler, a fully automated method for modeling protein complex structures. DiffModeler employs a diffusion model for backbone tracing and integrates AlphaFold2-predicted single-chain structures for structure fitting. Extensive testing on cryo-EM maps at intermediate resolutions demonstrates the exceptional accuracy of DiffModeler in structure modeling, achieving an average TM-Score of 0.92, surpassing existing methodologies significantly. Notably, DiffModeler successfully modeled a protein complex composed of 47 chains and 13,462 residues, achieving a high TM-Score of 0.94. Further benchmarking at low resolutions (10-20 Å) confirms its versatility, demonstrating plausible performances. Moreover, when coupled with CryoREAD, DiffModeler excels in constructing protein-DNA/RNA complex structures for near-atomic resolution maps (0-5 Å), showcasing state-of-the-art performance with average TM-Scores of 0.88 and 0.91 across two datasets.
Introduction
Proteins are fundamental molecules that carry out numerous functions in living organisms, including enzyme catalysis, cell signaling, and transport of molecules Cryogenic electron microscopy (cryo-EM) has gained significant popularity among experimental protein structure determination techniques1–3. This technique is increasingly favored due to several advantages, notably its superior capacity to determine the three-dimensional (3D) structures of large macromolecular complexes.
While reported map resolutions in literature have generally shown steady improvement over recent years, it remains common to encounter intermediate resolutions (∼5-10 Å) in real-life lab scenarios, posing challenges for structure modeling. When the map resolution is better than 5 Å, direct tracing of main-chain of proteins4–8 and nucleic acids9 have now become feasible due to recent modeling methods leveraging deep learning to detect atom positions within the map. However, for maps within the intermediate resolution range (5-10 Å), de novo modeling is generally not viable because the identification of amino acid residues and atoms remains elusive even with deep learning techniques. Hence, a practical approach involves conducting structure fitting using methods such as Phenix10, Flex-EM11, Assembline12, MultiFit13, Chimera14, MarkovFit15 and VESPER16 or employing manual fitting with known structures from PDB17 or predicted structure models18. Secondary structure detection methods within EM maps19 20,21 can aid in protein structure fitting. Despite many structures are determined through structure fitting, accurately orienting molecules within a noisy map in this resolution range remains challenging, especially for complexes comprising multiple subunits. The successful development of an automatic and precise structure fitting method for EM maps at intermediate resolutions would significantly support structural biologists.
Here, we developed DiffModeler, a fully automated structure fitting method for fitting large protein complex structures into cryo-EM maps with resolutions ranging from 5 to 10 Å. DiffModer uses a diffusion model22–24 to enhance the map aiding in finding precise fitting poses for these structures. The diffusion model is a parameterized Markov chain trained using variational inference to generate samples that match the underlined data after a finite time frame. Notably, the diffusion model has demonstrated considerable success in various areas of image processing, such as image generation23–26, segmentation27,28, and translation29,30. Additionally, it has found applications in bioinformatics, including protein docking31 and protein design32,33. Building upon these successful applications, DiffModeler integrates the diffusion model to enhance the extraction of structural information, facilitating accurate structure modeling for cryo-EM maps at intermediate resolutions.
To the best of our knowledge, this is the first fully automated and accurate method for modeling protein complex structures modeling in maps at intermediate resolutions. DiffModeler initiates the process by tracing protein backbones within a cryo-EM map, employing a diffusion model designed to capture the distinctive local density patterns representing protein backbones. Simultaneously, we use AlphaFold2 (AF2) 34, the cutting-age protein structure prediction method, to generate high-quality single-chain structures. Subsequently, the structure models from AF2 are fitted into the traced backbone map, producing many candidate poses through the VESPER16 structure fitting program. Ultimately, the assembly of the complete protein complex structure is generated by combining candidate poses of constituent subunits.
A benchmark conducted on 19 EM maps ranging from 5.0 Å to 10.0 Å resolution, depicting large protein complexes of 3 to 48 subunits, demonstrated that modeling with DiffModeler substantially outperformed conventional methods10,16. On average, the models by DiffModeler had a TM-score35 of 0.92, a significant contrast to the conventional methods that yielded an average TM-score of 0.41 Extending our evaluation, we further benchmarked DiffModeler on 6 experimental maps at a low resolution of 10 to 20 Å, where DiffModeler modelled the structure with a TM-Score of 0.27 to 0.97. Additionally, we integrated DiffModeler with CryoREAD, our DNA/RNA structure modeling method9, to build protein-nucleic acid complex structures in two datasets comprising 61 and 28 maps at a resolution of up to 5 Å. This combined protocol show-cased a state-of-the-art performance delivering an average TM-Score of 0.88 and 0.91, respectively.
Results
Overview of DiffModeler
We begin by explaining the DiffModeler algorithm depicted in Fig. 1. DiffModeler comprises four major steps: First, it detects the protein backbone positions in the input cryo-EM map after enhancing the map using a trained diffusion model. Secondly, it conducts the modeling of individual protein structures using AF2. Thirdly, structure models are fitted to the enhanced map using VESPER. Lastly, it selects and combines fitted single-chain poses to build the complete protein complex structures within the map. Below, we provide more information of each step.
Backbone Tracing via Diffusion Model
Achieving accurate structure fitting for maps of an intermediate resolution is untrivial. To aim for higher accuracy, the main innovation of DiffModeler is to use diffusion model to pronounce the density that belong to protein backbone. The input map is scanned with a 643 Å3 box with a stride of 32 Å along the map grid with a 1 Å interval. Given a box of cryo-EM density, the encoder of the conditional diffusion model computes an embedding of the input density box. Subsequently, the decoder starts with random Gaussian noise as the initial density distribution and iteratively refines its estimates to make it closer to the ground-truth traced backbone conditioned on the embedding from the encoder and the initial density input. The diffusion process is illustrated in Fig. 1b. This traced backbone provides clearer information for structure fitting compared to the original map. The diffusion model is trained via denoising diffusion implicit model (DDIM) framework. During training, the main objective of the model is to perform conditional denoising of a noisy density of the traced backbone to achieve the ground-truth traced protein backbone density in the map. The overall framework is optimized via Dice loss36 that considers agreement of the identified and ground-truth backbone positions. The training and inference framework is presented in Extended Data 1 and Extended Data 3, respectively, and further details regarding training and inferences can be found in Methods.
Structure Prediction by AF2
In DiffModeler we use predicted structures by AF234 to fit into the diffused map of pronounced main-chain. While there are instances where AF2 models do not align with the proteins’ conformations in particular cryo-EM maps37, ample cases exist38–40 where AF2 models demonstrated sufficient accuracy to be effectively integrated into EM maps. Specifically, for maps exceeding a resolution worse than 5 Å - aligning with DiffModeler’s objective, where de novo main-chain tracing becomes highly challenging - it would be pragmatic to consider AF2 models for structure modeling. Instead of generating new AF2 models, users can also use precomputed models available in the AlphaFold database18, which we employed in this work.
Structure Model Fitting with VESPER
The predicted structure models are fit to the diffused map using VESPER16, a structure and map fitting method developed in our group. Unlike the conventional methods that solely rely on correlation of map densities, VESPER takes into account local density gradient within maps. This approach has demonstrated superior performance surpassing existing methods16. The predicted structure models are converted into simulated maps at a 1 Å resolution. Subsequently, both these simulated maps of derived from the models and the diffused main-chain map are transformed into local dense points (LDPs) using the mean-shift algorithm4,41. LDPs serve to encapsulate the local salient features of density, proving to be more precise for alignment than using the unprocessed maps alone. Using VESPER, each subunit is aligned with the diffused main-chain map, resulting in the retention of the top 100 candidate poses.
Protein complex modeling by a greedy assembling algorithm
This phase is geared towards assembling the complete protein complex structure through an assembly of suitable poses from each subunit. To accomplish this, we have devised a greedy algorithm, which is explained in detail in the Methods section and visually outlined in Extended Data 4. In the preceding step, a collection of 100 poses has been constructed for each subunit, with each pose being evaluated based on a fitness score. From all combinations of subunit-pose pairs, we identify the subunit-pose with the highest score. Subsequently, we mask the local density of the map occupied by this selected subunit-pose pair and select the next best subunit-pose in the pool. This process iterates, systematically selecting the subsequent best subunit-pose pairs until all subunits seamlessly integrate into the emphasized diffused protein backbone map.
Structure Modeling Performance
We assessed DiffModeler’s performance on an independent dataset comprising 19 maps determined at resolutions between 5.0 Å and 10.0 Å. These structures are nonredundant in comparison to the training and validation datasets we used (see Methods). Supplementary Table 1 provides a comprehensive list of the maps included in this dataset. The range of residues of proteins in these maps varied from 1,202 to 13,462, while the number of protein chains ranged between 3 and 47. Notably, 12 out of 19 maps include protein complexes with more than 3,000 residues in total, which is larger than the size that the state-of-the-art protein docking method, Alphafold-Multimer42 was trained on and the size that Alphafold Colab notebook can handle.
Fig. 2 summarizes the modeling accuracy on the dataset from various perspectives. In Fig. 2a, we assessed the accuracy of the diffused backbone map generated by the diffusion model in the initial step of DiffModeler (as depicted in the traced backbone panel in Fig. 1). We computed recall and precision of the grid points within the diffused map with reference to the backbone heavy atoms (Cα, C, N) excluding oxygen of proteins in a map. The backbone recall was computed for each residue by determining the fraction of backbone heavy atoms within a 3 Å proximity to any grid points in the diffused map. This was then averaged across all residues in the map. Precision, on the other hand, was computed as the fraction of grid points within a 3 Å proximity to any backbone atoms. As a diffused map is to outline backbone atom positions within an input EM map, the volume of a map was in principle reduced, on average by 53.7%. This modification of maps notably elevated the average precision to 85.1% from 68.8% without significantly compromising the recall, which remained stable at an average of 93.1% from 96.6% (the original maps). Detailed results of individual maps are available in Supplementary Table 2.
Figure 2b illustrates our exploration into the impact of backbone recall within diffused maps on the subsequent accuracy of structure fitting. To assess the precision of modeled protein complexes, we employed MM-align43 to superimpose a modeled complex structure onto the accurate structure (referencing the PDB entry of the complex associated with the EMDB entry of the map) and calculated the TM-Score35. The TM-Score is a dimensionless metric utilized to gauge structural resemblance between two protein structures, with a value of 1 denoting identical protein pairs and values exceeding 0.5 indicative of meaningful similarity. Across the entire spectrum of observed backbone recall within this dataset, the TM-Score of modeled structures consistently registered notably high values, exceeding 0.9 for the majority of cases. On average, DiffModeler achieved a high TM-Score of 0.808. There were two instances where the TM-Score fell below 0.8. Notably, in one particular case (EMD-1871), despite a high backbone recall of 0.98 (close to 1.0), the TM-Score remained 0.781 (close to 0.8). This occurred as a result of two wrong single-chain structure fittings because of the low backbone precision 0.64.
In Fig. 2c, we compared the TM-Score of models constructed by DiffModeler with three other existing methods, the dock_in_map program in Phenix10, EMBuild44, and structure fitting by the raw VESPER16. For the latter, the original EM maps were used instead of the diffused maps for structure fitting. EMBuild is a recent method for fitting AF2 models within a cryo-EM map, which combines structure fitting, domain-based refinement, and graph-based iterative assembly. For all cases except two (or one) against Phenix, DiffModeler exhibited significantly higher TM-Scores compared to the three methods. The models produced by DiffModeler demonstrated an average TM-Score of 0.922. In contrast, VESPER (raw), Phenix, and EMBuild showcased a broad spectrum of model accuracy, averaging approximately half of DiffModeler’s performance, with TM-Scores of 0.407, 0.409, and 0.841, respectively. The notable contrast between DiffModeler and VESPER (raw) vividly highlights the substantial positive impact of utilizing diffused maps.
Figs. 2d to 2g aim to explore the relationship between model accuracy and both map resolution (Fig. 2d) and the size of the complexes (Fig. 2e, 2f and 2g). While the performance of other methods noticeably declined with increasing resolution and larger structure sizes, DiffModeler consistently maintained stable performance and notably outperformed in challenging scenarios involving lower resolutions or larger sizes. Fig. 2f compares the sequence identity of different methods relative to the structure size, which considers the fraction of residues in the reference structure that were successfully modeled and with the correct residue type. DiffModeler demonstrated the stable sequence identity while all other methods decreased dramatically when the structure size is big. On average, DiffModeler, VESPER (raw), Phenix, and EMBuild yielded sequence identity of 0.894, 0.742, 0.31 and 0.29, respectively. Fig. 2g investigates model accuracy concerning complex sizes using a different metric, the root-mean-standard-deviation (RMSD) of the aligned residues in the model, defined as the residues that are aligned by MM-Align’s superimposition. On average, DiffModeler, VESPER (raw), Phenix, and EMBuild yielded RMSD values of 3.89 Å, 10.09 Å, 10.48 Å, and 4.08 Å, respectively. Supplementary Table 3 provides the results of individual maps for different methods.
Examples of Protein Complex Structure Models
In this section we discuss five examples of models constructed by DiffModeler. In Fig. 3, for each example map, five panels are shown: the original experimental map, the diffused backbone map, LDPs of traced backbone, the structure models, and structure comparison between the constructed model with the PDB entry. The first example (Fig. 3a) is the state 2 of Mus musculus TRPML1 (EMD-6824, resolution: 7.4 Å), which encompasses four protein chains totaling 1,696 residues45. The resolution of this map was mentioned to be 7.4 Å in the paper45 but it may be even worse because Resmap46, a map resolution estimation program, reported 9.4 Å when we ran it. Modeling the interaction between the transmembrane domain and the peripheral domain was particularly difficult for this map, resulting in low TM-Scores of 0.30 and 0.47 using Phenix and VESPER (raw), respectively. In contrast, DiffModeler nicely traced the backbone by diffusion model. Thus, it demonstrates remarkable accuracy for structure modeling, achieving a TM-Score of 0.95 and an align ratio of 1.0, thereby providing a very precise structure for this challenging map.
The next example (Fig. 3b) is the closed conformation of Cx26 Gap junction channels (GJCs) at acidic pH (EMD-20916)47. This complex is difficult to model because it has 12 chains in a map of a relatively low resolution, 7.5 Å. DiffModeler was able to precisely identify helices in the map and correctly fit the 12 chains with a TM-Score of 0.88. In contrast, VESPER (raw) struggled to find correct poses of the chains, resulting in a TM-Score of 0.24.
Fig. 3c is the model for the human peptide-loading complex (PLC) editing module (EMD-3906, resolution: 5.8 Å)48. Modeling the full protein complex is difficult due to the substantial flexibility exhibited by calreticulin (the chain in purple) and the sparseness of the chain assembly. Fitting the structures to the original experimental map was challenging as indicated by a low TM-Score of 0.5 by VESPER (raw). In contrast, DiffModeler achieved a high TM-Score of 0.95, demonstrating that the diffusion model was effective to capture structural features in the map.
The next map is an example of a complex with many chains (Fig. 3d). It is the state 2 of a complex of the proteolytic core and the ATPase PAN (proteasome-activating nucleotidase) with 34 chains (EMD-213, resolution: 6.35 Å)49. DiffModeler was able to fit most of the subunits correct except for long helical domains locating at the top of the complex in the figure, yielding a TM-Score of 0.97. In comparison, with the original map, VESPER (raw) was only able to fill about 20% of the structure with a TM-Score of 0.20.
The last example (Fig. 3e) is a map from the minor state of T. thermophilus enzyme in complex with NADH (EMD-11237, resolution: 6.10 Å)50, which includes 15 chains. Fitting subunit structures to the original map was difficult as all the chains are α-helical and hard to distinguish as indicated by a low TM-Score of 0.64 by VESPER (raw). On the other hand, with the advantage of the map diffusion, DiffModeler showed accurate backbone structure tracing with backbone recall of 0.98 and a superior structure alignment with a TM-Score of 0.98 and an RMSD of 2.40 Å.
Fig. 4 illustrates the largest protein complex structure built by DiffModeler. This example is a map for proteasome in complex with ADP-AlFx (EMD-6693)51 determined at a 6.30 Å resolution. The complex comprises 47 protein chains totaling 13,462 amino acids. The diffusion model in DiffModeler achieved a 0.92 backbone tracing recall, laying a robust foundation for further protein complex structure modeling. Overall, the modeled complex showed high consistency with the native structure, as evidenced by a TM-Score of 0.94 and a sequence identity of 0.89. When individual chains are considered, 45 chains out of 47 chains were successfully modeled with an average sequence matching of 92.6%. The 45 chains include 17 individual chain structures shown in the figure, which appear in the front view of the complex. The high modeling accuracy is clearly due to the application of the diffusion model to the map, as VESPER (raw) alone only achieved 0.25 TM-Score. In contrast, EMBuild yielded a TM-Score of 0.88 and a sequence identity of only 0.47, which indicates many chains were not placed into the correct map regions.
Structure Modeling on cryo-EM maps at a lower resolution range
We further conducted an additional benchmark of DiffModeler on cryo-EM maps determined at low resolutions (10 to 18 Å). There were four maps in EMDB, which are in this resolution range and satisfy the map selection criteria we used, i.e. maps with a corresponding PDB entry that have a cross correlation and an overlap higher than 0.65 with the map, and all the chains have an AF2 model that have a TM-Score of 0.5 or higher. The modeling results are shown in Fig. 5 and detailed performance metrics are provided in Supplementary Table 4. For these four maps, the average TM-Score of models by DiffModeler was 0.74, while that of EMBuild, Phenix, and VESPER (raw) was 0.32, 0.36 and 0.27, respectively, which are below the cutoff of 0.5 that indicates meaningful structural similarity.
The first example (Fig. 5a) is ATP-Bound States of GroEL (EMD-1042, resolution: 10.3 Å) 52, comprising 14 chains with a total of 7,238 residues. Due to the low resolution of the map, the authors manually determined this structure by fitting individual chain structures while considering symmetry information. In contrast, DiffModeler demonstrated the capability to model the complete atomic structure automatically and accurately, achieving a TM-Score of 0.97 and an RMSD of 3.88 Å. The structural superimposition of the model with the corresponding PDB entry visually confirms the accuracy of the model.
The next one is a 10.3 Å map from anaerobic fatty acid beta oxidation trifunctional enzyme (anEcTFE) octameric complex (EMD-16134)53. The complex has eight chains, which is a dimer of a tetramer, shown as left and right volumes in the map in the figure. Notably, the original investigators faced challenges in directly resolving the structure at this low resolution. Their strategy involved first elucidating the structure from a tetramer map with a resolution of 3.55 Å. This process utilized fitting and real-space refinement techniques, incorporating the crystal structure information from the PDB entry 6DV254. Subsequently, they docked the solved structure into the low-resolution map and conducted further refinement to achieve the final structure. In marked contrast, DiffModeler automated the assembly of the entire protein complex based on the low-resolution map and achieving a high TM-Score of 0.87. Structures derived from EMBuild, Phenix and VESPER reported TM-Scores of 0.30, 0.30 and 0.19, respectively, emphasizing the distinct advantage offered by DiffModeler.
The third map was determined even at a lower resolution of 16.5 Å (Fig. 5c). The map contains 14 chains, cofilactin filament inside microtubule lumen (EMD-16877). The authors determined the structure by fitting cofilactin filament model (PDB: 5YU8) to the density manually followed by a local refinement55. The model built by DiffModeler had a TM-Score of 0.60, which was substantially higher than values of EMBuild, Phenix, and VESPER (raw), 0.17, 0.25 and 0.18, respectively, which failed to capture even the overall fold. The model by DiffModeler captured the overall shape of the complex. However, only 4 chains out of 14 chains are successfully aligned (sequence identity: 0.99). There were chains, e.g. chain E, H, K, which were placed in the correct region of the map but with an incorrect alignment.
In the last panel (Fig. 5d), we illustrate a case where DiffModeler’s performance was relatively poor. The presented structure is derived from a 11.0 Å resolution map or a 12-chain complex of MecA-ClpC with ATP and Walker B mutations introduced in the D2 ring (EMD-5608). The authors employed a complex manual procedure for structure determination: Initially, they used an initial model based on another crystal structure of ClpC (PDB: 3PXI) and employed MODELLER56 to fill in missing loops using other related structures as templates (PDB: 1JBK and 1R6B). Subsequently, the structure was manually docked into the cryo-EM maps, followed by flexible fitting using NAMD57. The model generated by DiffModeler had an overall TM-Score of 0.51, a barely significant score for structure modeling. Among 12 chains, 4 chains C, D, E, and F were modelled successfully with an average TM-score of 0.73 and sequence identity of 0.73. The rest of the chains were placed to incorrect regions of the map. TM-Scores of EMBuild, Phenix, and VESPER (raw) were even worse, 0.26, 0.30, 0.19, respectively.
To our knowledge, DiffModeler is the first method capable of automatically modeling protein complexes from maps in this low-resolution range. It distinctly demonstrates its advantage over existing methods.
Structure Modeling for Maps at a Higher Resolution
Although the primary focus of DiffModeler is on maps of low resolution, typically up to around 15 Å, where it can demonstrate its unique strengths over other existing structure modeling methods, it also performs effectively with higher resolution maps. To illustrate this versatility, we employed DiffModeler on maps with better than 5 Å resolution. We conducted benchmarking using two distinct datasets: one utilized in the papers of CryoREAD9 and another set employed in the work of ModelAngelo58. These datasets cover a broad spectrum of structures, encompassing protein-DNA/RNA complexes and protein-only configurations. The CryoREAD dataset comprised 61 maps (excluding those containing only nucleic acids), while the ModelAngelo dataset included 28 maps. The datasets varied in the number of protein chains, ranging from 1 to 48 chains, and the number of amino acids, ranging from 447 to 17,947. On this dataset we used the identical model and pipeline of DiffModeler without any alterations. For maps with protein-DNA/RNA complexes we first used CryoREAD to construct DNA/RNA structures and segmented the protein regions within the input maps. Subsequently, DiffModeler was employed to build protein complex structures. In this context, we utilized BLAST59 to search against the AlphaFold2 database and obtain the most similar single-chain AlphaFold2 structures, which were fit to the map. The modeling results of individual maps are provided in Supplementary Table 5.
In Fig. 6a-c, we depict TM-Score, sequence identity, and RMSD (Å) as functions of the protein complex size. For the CryoREAD/ModelAngelo datasets, the average TM-Score, sequence identity, and RMSD were 0.879/0.907, 0.851/0.864, and 3.08/2.79 (Å), respectively, which are comparable to the results obtained for the original dataset of 5.0 to 10.0 Å resolutions as shown in Fig. 2. TM-Score and sequence identity consistently exhibited high values for larger complexes, mirroring the trends observed in the original dataset (Fig. 2). RMSD, on the other hand, showed a slight increase as the structure size increases. However, it remained below 5 Å for the majority (88.5% for the CryoREAD dataset, 89.3% for the ModelAngelo dataset) of the cases.
While DiffModeler exhibited strong performance for the majority of cases, there were instances where it performed poorly, indicated by TM-Score or sequence identity values lower than 0.6. One contributing factor to these cases was the failure in predicting the AF2 chain structure (e.g., EMD-12935, EMD-27705, represented by the left lower orange dots in Fig. 6a and 6b). There are also two cases with low sequence identity with a high TM-Score (two orange data points in Fig. 6b with sequence identity of around 0.5). These cases are complex of hetero-oligomer that have chains with different sequences and a high structural similarity, where chains were placed in equivalent places of different chains.
While this work primarily focuses on protein structure modeling with DiffModeler, we also extended our modeling efforts to include nucleic acid structures within these maps using CryoREAD9. Backbone and sequence recall (the fraction of sugar and phosphate atoms that were placed within 5 Å and the fraction of nucleotides that were correctly identified in the model), were measured at 0.855 and 0.523 on the CryoREAD dataset and 0.829 and 0.413 on the ModelAngelo dataset. In Fig. 6f, we present a model of a protein-RNA complex structure featuring the RqcH DR variant bound to 50S-peptidyl-tRNA-RqcP RQC (EMD-13017, resolution: 3.2 Å))60. This complex encompasses 3,818 residues and 2,996 nucleotides. Notably, the model demonstrates high accuracy, with a TM-Score of 0.92 for the protein region and a backbone recall of 0.94 for the remaining RNA region, modeled using DiffModeler and CryoREAD, respectively.
Discussion
DiffModeler is a novel structure modeling method, which uniquely targets low resolution cryo-EM maps of 5-15 Å. Within this target resolution range, accurate modeling of protein structure complexes presents significant challenges. Not only de novo structure modeling4,5,37,58, fitting known structures to the map is also difficult as presented in the results with VESPER (raw). The presence of noisy density in cryo-EM maps makes it exceedingly difficult to detect precise atom and amino acid positions as well as main-chain conformations in the map. DiffModeler overcomes these obstacles by sculpting out main-chain conformations from low resolution maps using a diffusion model and by representing the salient points with LDPs, which enables to achieve substantially higher accuracy in structure fitting. The benchmark of DiffModeler on higher resolution, better than 5 Å, further indicated its generalizability and accuracy to handle maps with higher resolution. DiffModeler may appear similar to existing deep learning-based methods61,62, which modifies or sharpens maps, but it is distinct because it is not simply for sharpening map density; rather, it is a multi-step pipeline that outputs complex structure model as the end product.
Although DiffModeler has demonstrated overall accuracy and effectiveness, it is crucial to address the limitations of the current version. First, in some regions with low local resolution, the backbone tracing of diffusion model may be inaccurate, leading to incorrect structure fitting. To address this issue, further enhancements can be made to prioritize the fitting of regions with higher local resolution, mitigating the risk of such errors. Secondly, as of now, DiffModeler exclusively supports protein structure complex modeling. To expand its applicability, future developments will aim to extend its capabilities to support protein/DNA/RNA complex structure modeling, enhancing its versatility in addressing a wider range of biological systems. Furthermore, for high-resolution cryo-EM maps (better than 4 Å), it will be essential to develop local structure refinement approaches that leverage the density information to refine predicted structures, further enhancing accuracy and reliability. Addressing these limitations remains as future developments.
We firmly believe that DiffModeler will prove to be an indispensable and user-friendly tool for protein complex structure modeling, bridging a crucial gap in the availability of tools suitable for maps at low resolutions. The approach will also be applicable for cryo-electron tomography within the same resolution range, better than 15 Å, which is now increasingly available63,64.
Funding
This work was partly supported by the National Institutes of Health (R01GM133840, 3R01 GM133840-02S1) and the National Science Foundation (DMS2151678, DBI2003635, CMMI1825941, MCB2146026, and MCB1925643). XW is recipient of the MolSSI graduate fellowship.
Author contributions
DK conceived the study. XW designed and implemented DiffModeler and computed results. HZ and GT optimized the VESPER algorithm and participated in implementing the full pipeline. All the authors analyzed the results. XW drafted the manuscript and DK edited it. All the authors read and approved the manuscript.
Competing interests
The authors declare that there are no competing interests.
Data and materials availability
The source code of DiffModeler is made available at https://github.com/kiharalab/DiffModeler. It can run on our webserver https://em.kiharalab.org/algorithm/DiffModeler freely without installing it in a local machine. We also provide sequence version of DiffModeler on our server https://em.kiharalab.org/algorithm/DiffModeler(seq), which can automatically use the sequence information to find the most similar single-chain structure from RCSB and AlphaFold database and then model the full protein complex structure. The source code of ComplexModeler (including DiffModeler and CryoREAD) for protein-DNA/RNA complex structure modeling is made available at https://github.com/kiharalab/ComplexModeler. It is also available our webserver https://em.kiharalab.org/algorithm/ComplexModeler.
Methods
Constructing the benchmark dataset
Following the protocols employed in our previous works19,20,37,65, we complied a dataset of experimental cryo-EM maps for training, validation, and testing DiffModeler. Initially, we sourced cryo-EM maps from EMDB (as of January 26th, 2023) with resolutions between 5 Å to 10 Å and had the corresponding deposited structures in PDB with more than 20 residues. We only kept maps that contain only proteins. This initial screening yielded 840 maps.
Subsequently, we assessed the quality of structure-to-map fit by measuring cross-correlation and overlap between the EM maps and simulated maps generated from their respective structures in PDB 17. Maps were discarded if their corresponding structures displayed a cross correlation and overlap below 0.65. The remaining maps were manually inspected. These steps reduced the number of maps to 337.
To remove redundancy in the data, we applied single linkage clustering with the sequence identity of proteins within each map. Two maps were grouped into the same group if any protein chains from both maps exhibited a global sequence identity of 25% or higher. This clustering procedure resulted in 103 clusters. Out of the 103 clusters, we randomly allocated 68 clusters (230 maps) for the training set, 18 clusters (36 maps) for validation, 17 clusters (71 maps) for testing (Supplementary Table 1). It is important to note that the training, validation, and testing sets are fully independent from each other. Finally, we further filtered maps in the testing set that contained inaccurate predicted models with a TM-score lower than 0.5 in the Alphafold Database18. This filtering process resulted in the final testing dataset comprising 19 maps, reduced from 71 maps.
Pre-processing of map data
If a map had a grid size that is different from 1.0 Å, we interpolated the grid size to 1.0 Å using trilinear interpolation. The density values within a map were normalized to [0.0, 1.0] with a minimum-maximum normalization. Any negative values in a map were set to 0, and 0 was used as the minimum value for normalization. We set the maximum value for normalization as the 98th percentile density value, and any density values above that were capped at 1.0.
From each map, boxes of a size of 643 Å3 were collected by scanning the box across a map along three axes with a stride of 32 Å. Each grid point within the box was assigned a label indicating whether it belonged to the backbone. If a grid point was within 2.0 Å of any backbone atoms, it was assigned as backbone. Otherwise, the point was considered as background. A box was excluded from training if less than 0.1% of the grid points were assigned as backbone.
Training the conditional diffusion model of DiffModeler
Given the density information from cryo-EM maps, the objective of the diffusion model of DiffModeler is to generate the backbone labels in the map. We employed a conditional diffusion model, particularly, the denoising diffusion implicit model (DDIM)66, for its superior generation quality and efficiency. Inspired by the Pix2Seq28 framework, we designed an encoder-decoder network architecture (Extended Data 1). The encoder scans the input density map with a box of 643 size and embeds (outputs) hidden features of the map. The decoder utilizes three components as input of the conditional diffusion framework: the condition (the starting cryo-EM density map and hidden features), the noised backbone xt at timestep t, and the time t of the current step. From these inputs, the decoder outputs the predicted traced backbone yt. The noised backbone xt is a mixture of the ground-truth traced backbone density x0 and the Gaussian noise ε determined by the timestep t, which will be explained later. The encoder and the decoder are optimized simultaneously by comparing the predicted traced backbone yt and ground truth traced backbone x0. The encoder and decoder neural network architecture is shown in Extended Data 2a and 2b, respectively. Detailed network architecture of each component of the encoder and the decoder is shown in Extended Data 3.
As mentioned in the previous dataset section, we allocated 230 maps for training and 36 maps for validation of the conditional diffusion model. For each batch of training, we randomly sampled 8 boxes from the 230 maps. In total, there were around 16,000 and 3,500 boxes used in an epoch for training and validation, respectively. The framework was trained through 30 epochs and the final model is selected based on the validation performances.
The main objective of the model is to perform conditional denoising of a noisy density of the traced backbone to achieve the ground-truth traced protein backbone density x0 in the map. For training the model, a series of noisy traced backbone density maps were generated by randomly sampling the density values from the ground-truth traced backbone density and the Gaussian noise: where xt is the noised traced backbone at timestep t, αt is a cosine scheduling function shown in Eq. (2), x0 is the ground-truth traced backbone, and ϵ is a noise variable randomly sampled from the standard Gaussian noise, N(0, I). The ground-truth density of traced protein backbone was prepared by assigning the backbone label to each grid point based on the corresponding backbone native structure (N, Cα, C atoms). For any grid point in the map, if a grid point was within 2.0 Å of any backbone atoms, it was assigned as backbone. Otherwise, the point was considered as background.
During the training process, t was uniformly sampled from [0,1] for each map in the training set at each iteration to enforce that the framework successfully captures the diffusion process. The noised backbone xt for time t was obtained according to Eq. (1), from which the decoder computes predicted backbone map yt. The loss of yt was computed in comparison with the ground truth backbone x0.The used Dice loss36 was define as LDice represents the Dice loss of a predicted box P of prediction yt at timestep t and a corresponding ground truth box G of ground truth x0; N is the total number of grid points inside the box; pi ϵ P is the predicted probability of the i-th grid point in the predicted box; pi ϵ P is the binary ground truth of the i-th grid point, where 1 denotes the existence of backbone structure in the grid point and 0 indicates background; ε is a smoothing factor with value of 1e-6; L is the overall loss of a batch of B examples; LDice(k) represents the dice loss of k-th example’s detection. Here different samples in the same batch may have different timestep t since it is uniformly and independently sampled for each example.
We tested hyperparameter combinations of a learning rate of [1e-3, 1e-4, 1e-5] with a weight decay of [0, 1e-6, 1e-5, 1e-4] using the Adam optimizer 67. Among the combinations, the learning rate 1e-4 without weight decay showed the best grid-wise Intersection-over-Union (IoU) of 0.562 on the validation set. Training and validation of the conditional diffusion model took around 5 days. The computations are performed on two paralleled NVIDIA RTX A6000 48 GB GPU connected via NVLink.
Inference of the conditional diffusion model in DiffModeler
With the trained conditional diffusion model, we compute the traced backbone conditioned on the input cryo-EM density. The inference of conditional diffusion model is presented in Extended Data 4. Given a box of cryo-EM density, the encoder of the conditional diffusion model first embeds the hidden features of the input density box. Subsequently, the decoder starts with the random Gaussian noise as the initial distribution xT and iteratively refines the estimated density from t = T to t = 0 to make it closer to the ground-truth traced backbone x0, conditioned on the hidden features from the encoder and the initial density input.
Benefited from the training, which used uniformly sampled timesteps, we have the flexibility to choose the overall inference steps T. We chose T = 100 as we did not observe significant performance improvement with T larger than 100. The current timestep t is calculated by where t is the timestep at inference iteration i, and T is the overall inference steps.
The first iteration of the inference starts at timestep T. The decoder takes the random Gaussian noise xT, timestep T embedding, and the condition (i.e., the hidden feature embedding and the original cryo-EM map) as input and then it outputs yT.
In the following iterations with timestep t = T − 1, T − 2,…,0, the condition inputs are the same and the timestep t embedding obtained with Eq. (4). However, the noisy backbone input xt for decoder is different from training.During training xt is computed following Eq. (1) which uses x0 as the ground-truth traced backbone. As x0 is not available in the inference stage, the input of the decoder, xt, uses the decoder’s output yt+1 at timestep t + 1: where xt is the estimated noised backbone at timestep t, yt+1 is the decoder output at t + 1 and ϵ is the random Gaussian noise. In this equation, ϵ is also estimated by comparing the decoder’ noisy backbone input xt+1 and its corresponding backbone estimation output yt+1 from the decoder as follows: By combining Eq.(5) and Eq.(6), we can obtain the decoder input xt with decoder output yt+1 at timestep t = T − 1, T − 2,…,0. The inference process of the decoder is repeated for T = 100 times and x0 at timestep t = 0 is our final estimated backbone.
Single-chain structure fitting using VESPER
We used VESPER16 for fitting AF2 models of individual proteins to the modified map by the diffusion model. AF2 models of the protein chains were taken from the Alphafold database18. Supplementary Table 2 provides TM-score of the chains. The average TM-score was 0.922. The fitting process involved three main steps: Initially, AF2 models were transformed into simulated maps at a 1 Å resolution using TEMPy68. In the subsequent step, we simplified both the modified EM map and the simulated maps of the AF2 models into maps by condensing them into maps with local representative density points. This was achieved through the mean-shifting algorithm41 a method we devised in our early work, MAINMAST4. Finally, VESPER was used to globally align AF2 models into various poses within the representative map, generating different fit scores. The top 100 poses were retained as pose candidates for each subunit.
The mean shift algorithm is employed to compute maps featuring local representative density points by clustering density points within an EM map. First, grid points with a density exceeding 0 are identified. Then, the algorithm iteratively updates the coordinates of a grid point x by considering the weights associated with neighboring grid points: , where N(x) is the neighborhood of x, which are a set of neighboring grid points that satisfy is a Gaussian kernel function with bandwidth σ, as shown in Eq.(8); φ(x) is the density value of the grid point x. where the σ is the bandwidth set as 2. The mean-shift process is continued until convergence, i.e., with δ set to 0.001.
Following the completion of the mean-shifting process, we merged shifted points that were in close proximity. Points closer than a predefined threshold distance of 2.0 Å, were clustered together, and the grid point with the highest density within the cluster was designated as the representative node. This clustering and selection process was iterated until convergence of the selected representative nodes. The resulting set of points, known as representative points, forms the basis for the representative map (Fig. 4).
By completing this stage, we acquired two distinct representative maps using the mean-shift algorithm: the subunit representative map (RMsubunit) derived from the simulated map of the AF2 single-chain structure, and the backbone representative map (RMbackbone) obtained from the diffusion-traced backbone map.
The final step involves utilizing VESPER to globally align AF2 single-chain subunits into various poses within the backbone map. Specifically, VESPER aligns different RMsubunit to RMbackbone obtained in the preceding step. For each subunit representative map RMsubunit, VESPER systematically explores all potential poses to align RMsubunit with RMbackbone . In VESPER’s global search, we used a rotation scan interval of 10 ° and a translation scan interval of 2 Å. The fitness score of RMsubunit i at pose j is defined as: where P is the number of Cα positions of subunit i at pose j that have representative points in RMbackbone within 3 Å, and N is the total number of Cα positions of subunit i. Top 100 poses were kept for each subunit. This pool comprises of M ∗ 100 pose candidates for a protein structure complex with M chains.
Assembling subunits to generating the entire protein complex structures
Subunits, fitted to the map with different pose candidates, are then assembled into a complete protein complex structure model. We developed a greedy algorithm that iteratively assembles superimposed subunits within the map. The entire pipeline is depicted in Extended Data 4. As outlined in the preceding section, using VESPER, we generated 100 poses for each subunit in the map. Therefore, the subunit-pose pool for a given protein structure complex comprises M*100 pose candidates, all of which were scored using the subunit_fitscore (Eq. 9).
The initial step in the modeling process involves selecting the subunit-pose with the highest subunit_fitscore among all available poses. Subsequently, a local region within 20 Å from the fitted subunit-pose is masked out in the backbone map RMbackbone and the subunit pose is further optimized in terms of the subunit_fitscore with an interval of 5° for rotation scan and an interval of 1 Å for translation scan in that local region. Then, from the subunit-pose pool, subunit-poses are removed if the poses belong to the subunit that was just selected or if they have significant overlap with the selected subunit-pose. A subunit-pose is considered to have overlap if more than 10% of Cα positions of the subunit-pose are closer than 3 Å to any Cα positions to an already selected subunit-pose(s).
Following this, the subsequent best subunit-pose is selected iteratively until the subunit-pose pool is exhausted. In most cases, where each subunit assumes a correct pose, all M subunits are successfully fitted into the map. However, there are rare instances where not all subunits are selected due to significant overlap among all 100 poses of a subunit with other already-selected subunit-poses. In such scenarios, where some subunits remain unfitted due to substantial overlap, a new pose set is generated for these remaining subunits. This is achieved by fitting them to the remaining density regions within RMbackbone using VESPER. The same iterative process is then applied until all the subunits are successfully fitted.
Acknowledgments
The authors thank Jacob C. Verburgt, Anika Jain, Charles Christoffer for their help in literature search, discussion, and proofreading. The author would also thank Jessica A. Nash, Sam Ellis and Jing Chen’s suggestion for optimizing the released software.