De Novo Atomic Protein Structure Modeling for Cryo-EM Density Maps Using 3D Transformer and Hidden Markov Model

Accurately building three-dimensional (3D) atomic structures from 3D cryo-electron microscopy (cryo-EM) density maps is a crucial step in the cryo-EM-based determination of the structures of protein complexes. Despite improvements in the resolution of 3D cryo-EM density maps, the de novo conversion of density maps into 3D atomic structures for protein complexes that do not have accurate homologous or predicted structures to be used as templates remains a significant challenge. Here, we introduce Cryo2Struct, a fully automated ab initio cryo-EM structure modeling method that utilizes a 3D transformer to identify atoms and amino acid types in cryo-EM density maps first, and then employs a novel Hidden Markov Model (HMM) to connect predicted atoms to build backbone structures of proteins. Tested on a standard test dataset of 128 cryo-EM density maps with varying resolutions (2.1 – 5.6 °A) and different numbers of residues (730 – 8,416), Cryo2Struct built substantially more accurate and complete protein structural models than the widely used ab initio method - Phenix in terms of multiple evaluation metrics. Moreover, on a new test dataset of 500 recently released density maps with varying resolutions (1.9 – 4.0 °A) and different numbers of residues (234 – 8,828), it built more accurate models than on the standard dataset. And its performance is rather robust against the change of the resolution of density maps and the size of protein structures.


Introduction
Determining the three-dimensional (3D) atomic structures of macromolecules, such as protein complexes and assemblies [1][2][3], is fundamental in structural biology.The 3D arrangement of atoms provides essential insights into the mechanistic understanding of molecular function of proteins [4].In recent years, cryo-electron microscopy (cryo-EM) [5] has emerged as a key technology for experimentally determining the structures of large protein complexes and assemblies.However, modeling atomic protein structures from high-resolution cryo-EM density maps, which constitute a significant portion of the maps deposited in the EMDB [6], is both time-consuming and challenging, especially in the de novo setting when accurate homologous or predicted structures for target proteins or their units (chains) are unavailable [7,8].Modeling atomic protein structures from cryo-EM maps faces the challenges of identifying atoms of proteins in density maps as well as tracing the atoms into chains to form the backbone structures and registering amino acid sequences with them [9].
Despite the importance of the problem, only a small number of methods have been developed for determining atomic structures from cryo-EM maps, such as Phenix [10], DeepMainmast [9], DeepTracer [11], and ModelAngelo [12].Phenix is the most widely used standard tool of building atomic protein structures from cryo-EM density maps using classic molecular optimization.Deep-Tracer provides a web-based deep learning tool for users to predict atomic structures from density maps.ModelAngelo combines information from cryo-EM map data, amino acid sequences, and prior knowledge about protein geometries to refine the geometry of the protein chain and assign amino acid types.DeepMainmast, a recently developed method, integrates AlphaFold2 [13] with a density tracing protocol to determine atomic models from cryo-EM maps.Incorporating accurate AlphaFold-predicted structures into the modeling has significantly improved the quality of the structures reconstructed from cryo-EM density maps [9].
However, modeling multi-chain protein structures from cryo-EM density maps remains a challenging task for the existing methods, particularly when there are inaccurate predicted structures for target protein complexes or their chains to be used as templates.The de novo modeling of protein structures from only density maps without using templates is not only practically relevant in this situation, but also can help answer an important question: how much structural information can be extracted from cryo-EM density maps alone?In the de novo modeling context, we introduce Cryo2Struct (i.e., cryo-EM to structure), a fully automated, ab initio modeling method that does not require predicted or homologous structures as input to generate 3D atomic structures from cryo-EM maps alone.Cryo2Struct first uses a Transformer-based deep learning model with an attention mechanism [14] for capturing longrange atom-atom interactions to identify atoms and their amino acid types in cryo-EM density maps.Then it uses a novel Hidden Markov Model (HMM) [15] and a tailored Viterbi Algorithm [16] to align protein sequences with the predicted atoms and amino acid types to generate atomic backbone structures.Cryo2Struct is rigorously tested on 628 density maps in the stringent ab initio modeling setting in which no homologous/predicted structure is used as template and yields substantially improved modeling accuracy.

Atomic structure modeling workflow
Cryo2Struct takes a 3D cryo-EM density map and the corresponding amino acid sequence of a protein as input to generate a 3D atomic protein structure as output automatically (Fig. 1a-e).As in [17], we divide the problem of atomic structure determination from cryo-EM density map into an atom classification (recognition) task and a sequence-atom alignment task.The two tasks are performed by a Deep Learning (DL) block based on a transformer (Fig. 1b) and an alignment block based on a HMM (Fig. 1d), respectively.The DL block classifies each voxel (3D pixel) within the cryo-EM density map into different types of backbone atoms (e.g., Cα) or non-backbone voxel and predict their amino acid types, while the alignment block constructs a HMM [15] from predicted Cα atoms (corresponding to the hidden states in the HMM) and aligns the amino acid sequences with them using a customized Viterbi Algorithm, resulting in a sequence of Cα atoms connected as protein chains to form the atomic backbone structure of the protein.Additional details are available in the Methods Section.

Predicting backbone atom and amino acid types using 3D transformer
The first step of the atomic structure modeling is to detect the voxels in cryo-EM density map that contain backbone atoms and predict their amino-acid types.We designed and trained a 3D transformer-based model to classify each voxel of the cryo-EM density map into one of four different classes representing three backbone atoms (Cα, C, and N), and the absence of any backbone atom.Another 3D transformer-based model was designed and trained to classify each voxel of the cryo-EM density map into one of twenty-one different amino acid classes representing twenty standard amino acids and the absence of an amino acid or unknown amino acid.The models were trained as a sequence-to-sequence predictor, utilizing a Transformer-Encoder [14] to capture long-range voxel-voxel dependencies and a skip-connected decoder to combine the extracted features at different encoder layers to classify each voxel.The models were trained using the large Cryo2StructData dataset [18].Cryo2StructData is a comprehensive labeled dataset of cryo-EM density maps curated specifically for deep learningbased atomic structure modeling in cryo-EM.The models were trained and validated on the entire dataset comprising 6,652 cryo-EM maps for training and 740 cryo-EM maps for validation first and then were blindly tested on two test datasets.Because some predicted Cα voxels are spatially very close and likely correspond to the same Cα atom, Cryo2Struct employs a clustering strategy to group predicted Cα voxels within a 2 Å radius into clusters and select the centrally located Cα voxel in each cluster as the final predicted Cα atom (see the Method Section for details).

Aligning protein sequence with predicted Cα atoms
The goal of this step is to connect the predicted, disjoint Cα atoms into peptide chains and assigns amino acid types to them (sequence registering).To achieve the goal, the alignment block constructs a HMM from the predicted Cα atoms and their predicted amino acid type probabilities, in which each predicted Cα atom is represented by one hidden state.The transition probability between two hidden states is assigned according to the spatial distance between their corresponding Cα atoms, and the emission probability of each hidden state for generating 20 different amino acids is assigned according to the predicted probability of 20 different amino acid types for its Cα atom (see more details in Methods Section).The sequence of each chain of the protein is aligned to the HMM by a customized Viterbi algorithm to generate the most probable path of hidden states (Cα atoms).The path for a chain represents the connected Cα atoms of its backbone structure.The paths for multiple chains of a protein together with the sequences aligned with them form the final atomic backbone structure of the protein.Fig. 1f illustrates a high-quality structure modeled by Cryo2Struct, while Fig. 1g and 1h provide a detailed view of the structure.In Fig. 1g, the predicted Cα atoms are depicted and connected by the alignment block.Fig. 1h reveals the amino acid-type assignment for each Cα atom.

Structure modeling performance
Comparing Cryo2Struct with Phenix on a standard dataset After Cryo2Struct was trained and validated, we first compared the modeling performance of Cryo2Struct and Phenix [10] on a standard test dataset that was used to benchmark Phenix's map to model tool [19].Most density maps in the dataset are for multi-chain protein complexes, while some of them are associated with singlechain proteins.Their resolution ranges from 2.1 Å to 5.6 Å.The average resolution of the density maps is 3.68 Å.The number of amino acid residues included in the maps varies from 730 to 8,416.These test maps were not present in the training and validation dataset used to train the Cryo2Struct DL model.We chose Phenix as a reference here because it built the structures from the density maps in the same ab initio mode as Cryo2Struct is designed to do without using homologus or predicted protein structures as input.The structures built by Phenix were downloaded from its website ( [19]).The structural models built for the 128 test cryo-EM maps in the test data by Cryo2Struct and Phenix were compared with the true structures in the Protein Data Bank (PDB) to evaluate their quality.The evaluation results in terms of six metrics are presented in Fig. 2.
Fig. 2a plots the recall of Cα atoms of each model built by Cryo2Struct for each of 128 density maps against that by Phenix.The recall (sensitivity) represents the fraction of actual Cα atoms in the true structure that are correctly identified by a model.Cryo2Struct achieves an average recall score of 65%, much higher than 40% of Phenix, indicating that Cryo2Struct recovers a much higher percentage of Cα atoms correctly than Phenix.On 126 out of 128 density maps, Cryo2Struct has a higher recall than Phenix.
Fig. 2b plots F1 score of Cα atoms of Cryo2Struct against Phenix.The F1 score is the harmonic mean of precision and recall of Cα atoms.The precision (specificity) is the percentage of predicted Cα atoms that are correct ones.The F1 score is a balanced measure because it considers both the specificity and sensitivity of predicted Cα atoms.The average F1 score of Cryo2Struct and Phenix is 0.66 and 0.52, respectively.On 105 out of 128 maps, Cryo2Struct has a higher F1 score.
Fig. 2c plots the global normalized TM-scores of the models built by the two methods.A standard TM-score measures the similarity between a model and the corresponding known structure, which was calculated by a protein complex structure comparison tool -US-align [20] by enabling its options for aligning two multi-chain oligomeric structures and all the chains, as recommended for aligning biological assemblies.In this analysis, to Fig. 2 The comparative analysis of atomic models built for 128 test cryo-EM maps by CryoStruct and Phenix in terms of six metrics.In each panel of an evaluation metric, the score of the model built by CryoStruct for each map is plotted against that by Phenix for the same map.A dot above the 45 degree line indicates that CryoStruct has higher score than Phenix for the map.The number in the top-left corner represents the total number of maps on which CryoStruct has higher scores, while the number in the bottom-right corner denotes the total number of maps on which Phenix has higher scores.(a) The Cα recall of the atomic models of CryoStruct against Phenix; the recall is defined as the number of Cα atoms in the predicted model that are placed within 3 Å of the correct position in the corresponding known structure, divided by the total number of Cα atoms in the known structure.(b) The F1 score of Cα, which is the harmonic mean of precision and recall of Cα; it is a balanced measure quantifying a method's ability to make accurate Cα predictions while also capturing as many Cα atoms as possible.fairly compare the models built by Cryo2Struct and Phenix that usually have different lengths (numbers of residues), the global TM-score is normalized by the same length of the experimental structure.The TM-score ranges from 0 to 1, with 1 being the best possible score.The average global normalized TM-score of Cryo2Struct is 0.2, more than double 0.084 of Phenix.On 114 out of 128 density maps, Cryo2Struct has a higher normalized TM-score than Phenix.
However, the average global normalized TMscore of both methods is still low.One reason is that TM-score is a sequence-dependent global measure and obtaining a high normalized TMscore requires a high portion of Cα atoms of a large protein complex being not only correctly identified (high recall) but also all correctly linked at the same time, which is still very challenging for the de novo atomic model building from only the density maps that may have missing density values in some regions causing disconnection of Cα atoms.Another reason is that the TMscore computed by US-align is normalized by the total length of the known structure that is usually very large (average length of the true stuctures = 3794.95residues) rather than the length of a structurally aligned region between a model and the true complex structure.So, if the aligned region has a high TM-score but is only a faction of the entire known structure, the normalized TM-score would still be low.We expect that complementing density maps with the features extracted from protein sequences or AlphaFold-predicted structures as input for deep learning to predict Cα atoms and amino acid types can further improve the normalized TM-score [7,9].Fig. 2.d compares the aligned Cα length of the structural models built by Cryo2Struct and US-align, which was computed by US-align.The aligned length is the number of Cα atoms denoting residues that have been successfully matched or aligned between the a predicted model and its true structure.The average length of the true structures for all the 128 test maps is 3794.95.The average aligned length of Cryo2Struct's models is 945.55 (about 24.9% of the length of the known structure on average), 2.6 times the average length 358.51 of Phenix (about 9.4% of the length of the known structure on average).On 120 out of 128 density maps, CryoStruct has a larger aligned length than Phenix.Another interesting phenomena is that the models constructed by Cryo2Struct always have the same or very similar number of residues as the corresponding true structures (supplementary Fig. A1a) and therefore capture the overall shape of the true protein structure well despite of some errors in the local regions and atom connections, while the models constructed by Phenix usually are much smaller than the true structures (supplementary Fig. A1b) and therefore only renders a portion of the true structures.
In addition to using US-align to compare the models with the known structures, we also used the phenix.chaincomparison tool to compare a model and the true structure to compute the percentage of matching Cα atoms, as shown in Fig. 2e.It calculates the Cα match score, the percentage of Cα atoms (residues) in the model that have corresponding residues within a 3 Å distance in the true structure.It also reports the sequence match score, i.e., the percentage of the matched residues that have the same amino acid type (identity) as their counterparts in the true structure.The models built by Cryo2Struct have an average Cα match score of 43%, higher than Phenix's 41.2%.The average sequence match score is 13.4% for both Cryo2Struct and Phenix.It is worth noting that the Cα match score measures the match precision of Cα atoms in a model without considering the Cα atom coverage of the model.For instance, a partial model may have a high Cα match score but can only cover a small portion of its corresponding true structure.Because Cryo2Struct tends to build much more complete models than Phenix, their difference in terms of the Cα match score is less pronounced than in terms of the other metrics.
To remedy the shortcoming of the Cα match score calculated by the phenix.chaincomparison tool, we introduce a new Cα quality score considering both the Cα match precision and Cα coverage, which is the product of the Cα match score and the total number of predicted residues of a model obtained from the Phenix.chaincomparison tool divided by the number of the residues in the true structure.It is in the range [0, 1].A higher score signifies a more accurate and complete structural model.Fig. 2f compares the Cα quality scores of the structures modeled by Cryo2Struct and Phenix.The average Cα quality score for Cryo2Struct is 0.43, substantially higher than 0.23 of Phenix.On 125 out of 128 maps, Cryo2Struct has a higher Cα quality score than Phenix.The result shows that Cryo2Struct is capable of building structural models with higher average coverage and Cα matching score than Phenix.
Fig. 2g-i illustrates such an example (EMD ID: 8767).The true structure for the map (Fig. 2g) has 2,934 residues.The model built by Cryo2Struct (Fig. 2h) has 2,932 residues, about 7.5 times 392 residues of the model built by Phenix (Fig. 2.i) that is very fragmented, while the Cα match score and sequence match score of the former (i.e., 53.4% and 15%) are only 26-29% higher than 42.3% and 11.6% of the latter.In contrast, the Cα quality score of the Cryo2Struct constructed model is 0.54, 9 times 0.06 of the Phenix model, more accurately reflecting the difference in the quality of the two models.
Finally, we analyzed how the performance of the two methods change with respect to the resolution of the cryo-EM density maps.Fig. 3a-c plot the F1 scores, global normalized TM-scores, and Cα quality scores of the models built by the two methods against the resolution of the cryo-EM density maps measured in Angstrom ( Å), respectively.In terms of each of the three scoring metric, as expected, the accuracy of models built by the two methods decreases as the value of the resolution of the cryo-EM density maps increases (i.e., the resolution gets worse).The linear regression line for CryoStruct models is above that for Phenix, indicating that for the maps of the same resolution, the average score of the models built by CryoStruct is higher than that of Phenix.Moreover, the gap between the two increases as the value of resolution gets larger.This indicates that the quality of the models built by Phenix decreases faster than CryoStruct as the resolution of the cryo-EM density maps gets worse, i.e., CryoStruct is more robust against (or less sensitive to) the change of the resolution of density maps than Phenix.This is reflected by the less steep negative slope of the regression line for Cryo2Struct than that of Phenix and the less negative correlation between the score of Cryo2Struct and the resolution of the cryo-EM density maps than Phenix's.For instance, the Pearson correlation coefficient between the F1 score of Cryo2Struct and the resolution is −0.28, weaker than −0.40 of Phenix.
This observation is consistent in terms of all three metrics, indicating that that Cryo2Struct generally builds better models from cryo-EM density maps than Phenix and therefore can be used to improve the quality of the models built from both the existing cryo-EM density maps in the Electron Microscopy Data Bank (EMDB) and the new ones to be generated.

Evaluating Cryo2Struct on a large new dataset
We further evaluated the performance of Cryo2Struct on a large independent test dataset of 500 new maps with resolutions ranging from 1.9 Å to 4.0 Å.The average resoluiton of the density maps is 2.88 Å.These maps, released after April 2023, do not exist in the training and validation data in Cryo2StructData [21] that contains the cryo-EM density maps released before April 2023.The number of residues in the 500 maps ranges from 234 to 8,828.
On the new dataset, the average recall, F1 score, global normalized TM-score, Cα quality score, Cα sequence match score, and Cα match score of Cryo2Struct are 70%, 70%, 0.22, 0.50, 20.1%, and 49.5%, respectively, higher than 65%, 66%, 0.2, 0.43, 13.4%, and 43% on the standard test dataset, suggesting that the average quality of the cryo-EM density maps in the new dataset is higher than the standard dataset, which is consistent with the fact that the new cryo-EM density maps has the average resolution of 2.88 Å better than the average resolution of 3.68 Å of the old density maps in the standard test dataset.The relatively high recall, F1 score, Cα quality score, and Cα match score show that Cryo2Struct performs very well in identifying individual Cα atoms, while the relatively lower global normalized TM-score and Cα sequence match score indicates it is still very challenging to build correct connected models that cover and match most regions of a large protein structure and its sequence.
Fig. 4a-f illustrate the relationship between each of the six scores (the recall, F1 score, global normalized TM-score, Cα quality score, Cα sequence match score, and Cα match score) of the models and the resolution of the density maps.In terms of each metric, there is a negative relationship between the metric and the resolution, i.e., the quality of model decreases as the resolution Moreover, because only some regions of the models built by Cryo2Struct can be aligned with the true structures, we specifically analyzed the quality of the local regions of the models that can be aligned with the true structures by US-align in terms of aligned Cα length and RMSD (root mean squared distance) of the aligned regions.Supplementary Fig. A2 plot RMSD of the aligned regions against their lengths for all the models built for the density maps in the new test dataset.The average length of the aligned regions of the models is 532.51 residues, accounting for 29% of the average length of true structures (i.e., 1837.43 residues).And the average RMSD of the aligned regions is 1.6 Å.The results show that Cryo2Struct can build a significant portion of the protein structures with very high accuracy (low RMSD).And the RMSD decreases (i.e., the accuracy increases) with respect to the length of the aligned regions, according to the weak Pearson's correlation of −0.134 between the RMSD and the length of aligned regions.It is interesting to observe that Cryo2Struct can build high-accuracy models of large aligned regions up to thousands of residues.
Finally, we investigated how the global quality of the models changes with respect to the length (number of the residues) of the known structures (i.e., the size of the proteins) (Fig. A3).Unlike their similar relationship with the resolution of the cryo-EM density maps, the six metrics (recall, F1 score, global normalized TM-score, Cα quality score, Cα sequence match score, and Cα match score) exhibit different relationship with the size of the proteins.The Cα recall and F1 score have a weak positive correlation (i.e., 0.259 and 0.258 respectively) with the size of proteins indicating that it is slightly easier to recognize individual Cα atoms for larger protein structures, while there is a weak negative correlation (i.e., −0.214) between the global TM-score and the size of proteins indicating it is slightly more difficult to build accurate full-length models for larger proteins.And the correlation for Cα quality score, Cα sequence match score, and Cα match score with respect to the size of proteins is almost 0, indicating that these scores are largely independent of the size of proteins.

Discussion
De novo modeling of protein structure solely from density maps, without using structural templates, is an interesting and important issue because it establishes a lower bound on the amount of structural information that can be extracted from density maps.We developed Cryo2Struct, a de novo AI modeling method based on the transformer and HMM for building atomic protein structural models from medium-and high-resolution cryo-EM maps alone.The modeling process is fully automated, requiring no human intervention and no input from external tools.Cryo2Struct can rather accurately identify individual Cα atoms in density maps and is robust against the decrease of the resolution of density maps.Moreover, Cryo2Struct achieved substantially better performance than the most widely used de novo modeling method -Phenix in terms of multiple evaluation metrics including Cα recall, F1 score, global normalized TM-score, aligned Cα length, Cα match score, Cα sequence match score, and Cα quality score.In general, it can build much more accurate and more complete protein structures from cryo-EM density maps than Phenix, therefore advancing the state of the art of ab initio modeling of protein structures on cryo-EM density maps and providing a useful means for the community to build better protein structural models from both existing cryo-EM density maps and new ones to be generated to support biomedical research.
However, even though Cryo2Struct can identify most Cα atoms correctly with high F1-score and build high-accurate atomic models for some regions of large protein structures with very low RMSD, building high-accurate models covering most regions of large protein structures from density maps alone remains very challenging, reflected in low global TM-score and Cα sequence match score of the models.Obtaining high global TMscore and Cα sequence match score requires most if not all individual Cα atoms not only being correctly identified but also being correctly linked as peptide chains and assigned with correct amino acid types, which is combinatorially more challenging than predicting individual Cα atoms.A prediction error for only a few Cα atoms caused by missing or noise values in cryo-EM density maps that are very common may drastically lower the TM-score and Cα sequence score of the models because only when a long continuous stretch of chains are correctly predicted, the high TM-score and Cα sequence match score can be obtained.However, experimentally generating cryo-EM density maps that contain high-resolution density values covering every residue of a protein structure is still very challenging.
We envision that the global TM-score and Cα sequence match score of the structural models built from cryo-EM density maps can be further improved from at least three different aspects.The first is to develop more sophisticated and robust AI methods to predict protein atoms and their amino acid types with higher sensitivity and specificity from cryo-EM density maps to help build more accurate and complete protein chains.The second is to use additional inputs such as protein sequence information and AlphaFold-predicted protein structures to complement missing information in cryo-EM density maps to obtain more accurate and complete predictions.The third is to generate more accurate and complete cryo-EM density maps in the first place for the AI methods to use, which is being done by the community and would automatically improve the performance of Cryo2Struct as seen on the new test dataset in this work.In the future, we plan to further expand Cryo2Struct to integrate cryo-EM density maps, protein sequences, and AlphaFold-predicted structures with deep learning together to build more accurate and complete protein structures.As more and more high-quality cryo-EM maps are being deposited in EMDB [6], such tools for automatically modeling atomic structure from them can enable scientists to better leverage this valuable resource to advance biomedical research.

Structure Modeling Process
As illustrated in Fig. 1, Cryo2Struct tackles the problem of building 3D atomic structural models from 3D cryo-EM density map in the following three main steps: 1. Predict Cα voxels and their amino acid types in the cryo-EM density map of a protein using a deep learning method based on transformer.

Construct a HMM model (λ) with predicted
Cα voxels as hidden states and with emission and transition probability parameters set according to their predicted probabilities and their pairwise distance.

Predicting Cα voxels and amino acid types
We designed a transformer-based model (Fig. 6), inspired by U-Net Transformers (UNETR) [22], for voxel classification in cryo-EM density maps.
The model follows the contracting-expanding pattern of U-Net [23], utilizing a series of transformerbased encoders to extract features at multiple layers.The features extracted from different layers are utilized by a CNN-based decoder using skip connections to classify the voxels into different classes.One model is trained to classify voxels into four different classes (Cα, C, N, and the absence of an atom) (atom type classification).Another model is trained to classify voxels into 21 classes, representing 20 amino acid types and an absent or unknown amino acid type.

Deep learning architecture
The deep learning model (Fig. 6) takes in an input sub-grid of cryo-EM density map represented as a 4D tensor with dimensions where H is the height, W is the width, D is the depth, and C is the number of channels (C=1 for the input), denoted as x ∈ R H×W ×D×C .x is then divided into a series of flattened, uniform nonoverlapping patches (x v ∈ R N ×(P 3 .C) ), where P denotes the patch dimensions and N = (H × W × D)/P 3 is the number of the patches.The series of the patches are projected by a 3D convolution layer into a K-dimensional embedding space.A 1D learnable positional encoding E pos ∈ R N ×K is then added to the projected patches, which subsequently serve as the input to the transformer encoder.Here, P is set to 16 and the embedding dimension (K) to 768.
Cryo2Struct uses an encoder of 12 blocks [14] each consisting of a normalization layer, [24], a multi-head attention layer, a normalization layer and a multi-layer perceptron to generate features for the input series of patches.The features from four different blocks: z i (i.e., i ∈ { 3, 6, 9, 12 }), with size H×W ×D P 3 × K are reshaped them into H P × W P × D P × K, respectively.The features of the four blocks and the original input are processed by deconvolution and/or convolution layers and concatenated together in a U-Net fashion step by step to generate the final feature tensor of the same dimension as the original input (see Fig. 6 for details), which is used by a 1×1×1 convolution layer to classify each voxel.

Training and validation
We used the Cryo2StructData [21] dataset, which includes maps with the resolution in the range [1.0 Å -4.0 Å] , to train and validate the two transformer models.The cryo-EM density maps in the dataset were released till 27 March 2023.The dataset is split according to a 90% to 10% ratio into the training and validation datasets.The total dataset has 7, 392 cryo-EM density maps.The training dataset and validation dataset has 6, 652, and 740 cryo-EM density maps, respectively.The atom types and amino types of the voxels in the density maps are labeled.
The training was performed on sub-grids (dimension: 32 × 32 × 32) of the density maps, utilizing a batch size of 720, the NADAM optimizer [25] with a learning rate of 1e-4, and a dropout rate of 0.1.We used a distributed data parallel (DDP) technique to train the models on 24 compute nodes each equipped with 6 NVIDIA V100 32GB-memory GPUs in the Summit supercomputer [26].The deep learning models were trained with the weighted cross entropy loss function described in Equation 1 to handle the class imbalance problem.
where, L(x, y) represents the weighted crossentropy loss.N is the number of samples in the minibatch.C is the number of classes.w c is the weight for class c computed using Formula 2. x n,c is the logit for class c in sample n, and y n,c is a binary indicator (0 or 1) of whether class is the correct classification for sample n. ω c in Equation 2 represents the weight assigned to class c, n c is the number of samples in class c, and classes k=0 n k is the total number of samples across all classes.
Throughout the training process, we monitored both training and validation loss along with the F1 score, known for its effectiveness in handling class-imbalanced data as it represents the harmonic mean of precision and recall.We implemented and trained the deep learning models using PyTorch Lightning [27], version 1.7.3.The evaluation metrics (F1, Recall, and Precision) were computed using TorchMetrics [28], version 0.9.3.We tracked the model's performance on both training and validation data using the Weights and Biases tool.If the validation loss did not improve for five consecutive epochs, we reduced the learning rate by a factor of 0.1.We saved the top 5 trained models with lowest validation loss during the training and selected the model with the highest F1 score on the validation dataset as the final trained model.

Cα voxel clustering
When applying the trained transformer to a density map to predict Cα voxels, it is common that multiple spatially close voxels corresponding to the same Cα atom are predicted as Cα atoms.To remove redundancy, Cryo2Struct employs a clustering strategy to group predicted Cα voxels within a 2 Å radius into clusters.The average Cα probability and the amino acid type probability of Cα voxels in each cluster are computed.The centrally located Cα voxel in each cluster and the average probabilities of the cluster are used to represent the Cα atom of the cluster, while the other Cα voxels in the same cluster are removed.

Connecting Cα atoms into
Protein Chains and Assigning Amino Acids to Cα atoms Connecting predicted Cα voxels into chains and accurately assigning their amino acid types is a challenging task.We designed a novel Hidden Markov Model (HMM) whose hidden states represent predicted Cα voxels to accomplish it seamlessly in a single step, which is used by a customized Viterbi algorithm to align the sequence of a target protein with the HMM.The hidden states (Cα voxels) aligned with the sequence are joined together to form the backbone of the protein, in which the amino acid type of each Cα voxel is set to the type of the amino acid aligned with it.Cα voxels with a probability higher than 0.4 are selected as the hidden states for the HMM.The HMM uses K hidden states to represent predicted K Cα voxels.Let's denote individual Cα hidden states in the HMM as S = S 1 , S 2 , S 3 , . . ., S K and individual symbols (amino acid types) as V = V 1 , V 2 , V 3 , . . ., V N , where N is equal to the number of standard amino acids (i.e., 20) generated from the hidden states.The hidden states in the HMM are fully connected, where there is a direct transition from any state to any other state, as depicted in Fig. A4a.The transition probabilities between Cα hidden states are stored in the transition matrix, denoted as γ with a size of K × K.The emission probabilities of generating observation symbols from the hidden states are stored in the emission matrix, denoted as δ, with a size of K × N .The initial state distribution is denoted as Π =< π 1 , π 2 , π 3 , . . ., π K >, where π i is the probability that the HMM starts from state i.A hidden path may start from and end at any state.We use a compact notation, λ = (γ, δ, Π), to represent the HMM.

Hidden Markov Model Construction
The transition probability matrix (γ) is constructed based on the distance between two predicted Cα states (voxels) in the 3D space, calculated from their coordinates using Equation 3. The distance x is converted into a probability using the modified Gaussian probability density function (PDF) in Equation 4(f (x)), with a mean (µ) of 3.8047 Å and a standard deviation (σ) of 0.036 Å.Both µ and σ were estimated from the distances between two adjacent Cα atoms in the true protein structures in the training dataset.Additionally, we introduce a fine-tune able scaling factor (Λ) that multiplies with (σ) to make the model adjustable.We set (Λ) to 10.The transition probabilities from one state to all other states are normalized by dividing each of them by their sum.
The emission probability matrix (δ) for each Cα state (voxel) is calculated from both its predicted amino acid type probability and the background (prior) probability of 20 amino acids in the nature.Specifically, the geometric mean of the two is calculated as √ a × b, where a corresponds to the predicted probability for each amino acid type, and b represents the background frequency of the amino acid type that was precomputed from the true protein structures in the training dataset.The geometric means for 20 amino acid types are normalized by their sum as their final emission probability.An example of emission matrix is shown in A4.b The initial probability for a Cα state (π i ) is the probability that it generates the first amino acid of the protein sequence normalized by the sum of these probabilities of all the Cα states.

Aligning protein sequence with HMM using a customized Viterbi Algorithm
The customized Viterbi algorithm is used to find the most likely path in the HMM to generate a protein sequence with the maximum probability.The only difference between the customized Viterbi algorithm and the standard Viterbi algorithm is that the former allows a hidden state occurs at most once in the aligned hidden state path while the latter does not have such a restriction.The restriction is needed because one hidden state denoting a Cα voxel can only be aligned to (occupied by) one amino acid in a protein sequence.The details of the algorithm is depicted in Algorithm 1, generating a path X = x 1 , x 2 , x 3 , . . ., x T , which is a sequence of states x t ∈ S aligned with a protein sequence (the observation O).For a multi-chain protein complex, the sequence of each chain is aligned with the HMM one by one.Once a chain is aligned, the states in the hidden path aligned with it are removed from the HMM before another chain is aligned.In the alignment process, it is ensured that any Cα state occurs at most once in one hidden state path.One unique strength of this HMM-based alignment approach is that every amino acid of the protein is assigned to a Cα position as long as the number of the predicted Cα voxels is greater than or equal to the number of the amino acids of the protein, which is usually the case when the 0.4 probability threshold is used to select predicted Cα atoms to construct the HMM.This is the reason that Cryo2Struct builds very complete structural models from density maps.

23:
end for

24:
return X 25: end function and is provided as a shared library, which is then linked with the Python program of constructing the HMM.

Inference and testing
After Cryo2Struct was trained and validated, it was blindly tested on a standard test dataset of 128 density maps and a large new dataset of 500 density maps.For each test map, the Cryo2Struct inference process consisting of the deep learning prediction and the HMM alignment was executed on compute nodes each with a 40GB GPU, 150 GB RAM, and 64 CPU cores.The deep learning prediction was carried out on the GPU, whereas the HMM alignment was executed on the CPU cores.The model building for the largest map (EMD ID: 40492 with resolution 2.9 Å), involving 8,828 modeled residues, was completed in 9 hours on a compute node, while it took only 2.90 minutes to build a model for the smallest map (EMD ID: 36426 with resolution 3.3 Å) with 234 residues.

Fig. 1
Fig. 1 An overview of the automated prediction workflow of Cryo2Struct.Given a 3D cryo-EM density map of a protein as input (a), the Deep Learning block based on a transformer (b) generates a voxel-wise prediction of Cα atoms and their amino acid type.A clustering step (c) is used to merge nearby predicted Cα atoms into one atom to remove redundancy.The predicted Cα atoms and their amino acid type probabilities are used by the Alignment block (d) to build a Hidden Markov Model (HMM), which is used by a customized Viterbi Algorithm to align the sequence of the protein with it to generate a 3D backbone atomic structure for the protein (e).(f ) shows the skeleton of the Cryo2Struct modeled structure for a test cryo-EM density map released on September 13, 2023 (EMD ID: 41624; resolution 2.8 Å), where each chain is colored differently.(g) depicts the connected Cα atoms, and (h) shows the amino acid types assigned to the Cα atoms; the modeled structure has 1,585 amino acid residues; and the F1 score of Cα atom prediction is 89.1%.
Fig.2The comparative analysis of atomic models built for 128 test cryo-EM maps by CryoStruct and Phenix in terms of six metrics.In each panel of an evaluation metric, the score of the model built by CryoStruct for each map is plotted against that by Phenix for the same map.A dot above the 45 degree line indicates that CryoStruct has higher score than Phenix for the map.The number in the top-left corner represents the total number of maps on which CryoStruct has higher scores, while the number in the bottom-right corner denotes the total number of maps on which Phenix has higher scores.(a) The Cα recall of the atomic models of CryoStruct against Phenix; the recall is defined as the number of Cα atoms in the predicted model that are placed within 3 Å of the correct position in the corresponding known structure, divided by the total number of Cα atoms in the known structure.(b) The F1 score of Cα, which is the harmonic mean of precision and recall of Cα; it is a balanced measure quantifying a method's ability to make accurate Cα predictions while also capturing as many Cα atoms as possible.(c) The TM-score of the atomic models normalized by the length of the known structure; the normalized TMscore is calculated by using US-align to align the atomic models with their corresponding known structures.(d) The length of aligned Cα atoms; it is calculated by using US-align to align the predicted model and the known structure.(e) The Cα match score of the atomic models; it is calculated by using Phenix.chaincomparison tool to compare them with the known structures.(f ) The Cα quality score; it is the product of the Cα match score and the total number of predicted residues divided by the total number of residues in the experimental structure; the total number of predicted residues is calculated by Phenix.chaincomparison tool.(g) The true structure of EMD ID: 8767 (PDB ID: 5W5F); the map was released on 2017-08-16 with resolution of 3.4 Å. (h) The Cyo2Struct model and its scores.(i) The Phenix model and its scores.

Fig. 3
Fig. 3 The plots of the scores (F1 score, global normalized TM-score, and Cα quality score) of the models built by Cryo2Struct and Phenix against the resolution of the 128 cryo-EM density maps.Blue dots denote Cryo2Struct constructed models and red dots the Phenix models.The solid lines depict linear regression lines, and the colored area represents a 95% confidence interval.The confidence interval is narrower (i.e., the linear estimation is more certain) in the resolution range [3 Å-4.5 Å] where there are more data points.(a) F1 score against resolution.The equation of the regression line for Cryo2Struct (blue) is y = −0.1209x+ 1.0966, while for Phenix (red), it is y = −0.1998x+ 1.2618.The correlation between F1 score of Cryo2Struct and the resolution is −0.28, while for Phenix, it is −0.40.(b) The normalized global TM-score against resolution.The equation of the regression line for Cryo2Struct is y = −0.0339x+ 0.3057, while for Phenix, it is −0.0706x + 0.3447.The correlation for Cryo2Struct is −0.24, while for Phenix, it is −0.43.(c) Cα quality score against resolution.The equation of the regression line for Cryo2Struct is −14.1318x+ 94.8512, while for Phenix, it is −17.9190x+ 88.6207.The correlation for Cryo2Struct is −0.43, while for Phenix it is −0.49.

Fig. 4
Fig. 4 The quality of atomic models built for 500 test cryo-EM maps.The solid lines depict linear regression lines, and the colored area represents a 95% confidence interval.(a) The Cα recall versus resolution; the regression equation: −0.0466x + 0.8350; Pearson's correlation: −0.201.(b) The F1 score versus resolution; the regression equation: −0.0468x + 0.8357; the correlation: −0.202.(c) The normalized TM-score versus resolution; the regression equation: −0.0222x + 0.2762; the correlation: −0.11.(d) The Cα quality score versus resolution; the regression equation: −0.0741x + 0.7080; the correlation: −0.298.(e) The Cα sequence match score versus resolution; the regression equation: −7.9226x + 42.8422; the correlation: −0.234.(f ) The Cα match score versus resolution; the regression equation: −7.4408x + 70.8924; the correlation: −0.299.(g)A modeling example.One on the left is the density map (EMD ID: 16963), in the middle is the true structure (PDB ID: 8OLU), and on the right is the model built by Cryo2Struct.The structure is a hetero 28-mer with a stoichiometry of A2B2C2D2E2F2G2H2I2J2K2L2M2N2 and a weight of 848.37 kDa.The total number of modeled Cα atoms is 6,316.

Fig. 5
Fig. 5 The high-quality models built for four test cryo-EM maps.In each panel from left to right are the cryo-EM density map, the true structure, and the model built by Cryo2Struct.The chains in both the true structure and the model are colored with distinct colors.The total Cα number shown in each panel is the total number of residues in a model.(a) The result for EMD ID: 17961 (PDB ID: 8PVC, released on 2023-11-29, and resolution of 2.6 Å).(b) The result for EMD ID: 17287 (PDB ID: 8OYI, released on 2023-11-08, and resolution of 2.2 Å. (c) The result for EMD ID: 37070 (PDB ID: 8KB5, released on 2023-10-18, and resolution of 2.26 Å).(d) The result for EMD ID: 35299 (PDB ID: 8IAB, released on 2023-08-02, resolution of 2.96 Å).

Fig. 6
Fig. 6 The deep Learning architecture for backbone atom and amino acid type classification.The network takes a 32×32×32 sub-grid of cryo-EM density map as an input with one channel representing the density value of voxels.The input is divided into a series of patches.The patches are projected into an embedding space by a 3D convolution layer, and then is added with a positional encoding.The patches are then processed by an encoder, comprising 12 identical blocks each with a normalization layer, a multi-head self-attention layer, a normalization layer, and a multi-layer perceptron (MLP).The encoded features of blocks 3, 6, 9 and 12 denoted as (z 3 , z 6 , z 9 , z 12 ) and the original input are integrated into the decoders via skip connections in a U-Net fashion, each of which includes convolution and deconvolution layers with instance normalization (IN), Leaky ReLU activation, and feature concatenation.The last hidden features are used by a 1 x 1 x 1 convolution layer to generate the final 3D sub-grid output of the same size as the input, i.e., 32 × 32 × 32, with (C) output channels (i.e., 4 for the backbone atom type classification (Cα, N, C and the absence of an atom) and 21 for the amino acid type classification (20 standard amino acids and no/unknown amino acid).The amino acid-type classification model has 92.281893 million parameters, whereas the atom type classification model has 92.281604 million parameters.

Fig. A1
Fig. A1 Length of structural models constructed by Cryo2Struct or Phenix versus (VS) length of the true structures in the standard test dataset.(a) Cryo2Struct models VS true structures.(b) Phenix models VS true structures.

Fig. A2
Fig. A2 RMSD versus the length of the aligned regions of the atomic models built for 500 test cryo-EM maps.The models were aligned with the true structuers by US-align.The solid line depicts linear regression line, and the colored area represents a 95% confidence interval.The regression equation: y = −0.0001x+ 1.6401; the correlation: −0.134.The average RMSD of the models is 1.60 Å.The average aligned length is 532.51where as the average length of true structure is 1837.43.Cryo2Struct models have about 29% aligned length.

Fig. A3
Fig. A3 The quality scores of atomic models built for the 500 cryo-EM maps in the new test dataset versus (VS) the length of the true structures.The solid lines depicts linear regression lines, and the colored area represents a 95% confidence interval.(a) The Cα recall VS length of true structure; the regression equation: 0.0000x + 0.6712; Pearson's correlation: 0.259.(b) The F1 score VS length of true structure; the regression equation: 0.0000x+0.6714;the correlation: 0.258.(c) The normalized TM-score VS length of true structure; the regression equation: −0.0000x + 0.2328; the correlation: −0.214.(d) The Cα quality score VS length of true structure; the regression equation: 0.0000x + 0.4863; the correlation: 0.066.(e) The Cα sequence match score VS length of true structure; the regression equation: −0.0002x + 20.4579; the correlation: −0.025.(f ) The Cα match score VS length of true structure; the regression equation: 0.0004x + 48.6615; the correlation: 0.065.

Fig. A4 A
Fig. A4 A Hidden Markov Model (HMM) used for aligning protein sequences with predicted Cα atoms (voxels) to generate protein backbone traces.(a) The states of the fully connected HMM.A hidden path can start from or end at any Cα state.It is worth noting that there is no gap state in the HMM and therefore every amino acid in a protein sequence can be aligned to one Cα atom.(b) The emission probabilities of the hidden Cα states are the normalized geometric mean of the predicted amino acid type probability and the background (prior) probability for 20 amino acids in the nature, referred to by their abbreviation.
3. Align the amino acid sequence (i.e., O = O 1 , O 2 , O 3 . . .O T , where O t ∈ V ; V : the set of 20 standard amino acids) with the HMM model λ to find the most likely Cα state sequence (path) (X = x 1 , x 2 , x 3 , . . .x T where x t ∈ S; S: the set of Cα hidden states) of generating the sequence to form the backbone structure of the protein.