Robust residue-level error detection in cryo-electron microscopy models

Building accurate protein models into moderate resolution (3-5Å) cryo-electron microscopy (cryo-EM) maps is challenging and error-prone. While the majority of solved cryo-EM structures are at these resolutions, there are few model validation metrics that can precisely evaluate the local quality of atomic models built into these maps. We have developed MEDIC (Model Error Detection in Cryo-EM), a robust statistical model to identify residue-level errors in protein structures built into cryo-EM maps. Trained on a set of errors from obsoleted protein structures, our model draws off two major sources of information to predict errors: the local agreement of model and map compared to expected, and how “native-like” the neighborhood around a residue looks, as predicted by a deep learning model. MEDIC is validated on a set of 28 structures that were subsequently solved to higher-resolutions, where our model identifies the differences between low- and high-resolution structures with 68% precision and 60% recall. We additionally use this model to rebuild 12 deposited structures, fixing 2 sequence registration errors, 51 areas with improper secondary structure, 51 incorrect loops, and 16 incorrect carbonyls, showing the value of this approach to guide model building.


INTRODUCTION 16
While technological advances in cryo-electron microscopy (cryo-EM) have made it 17 possible to resolve protein complexes to resolutions rivaling X-ray crystallography [1], protein 18 heterogeneity has limited the resolution for the majority of complexes, with 78% of cryo- EM 19 maps deposited in the past year reporting a resolution worse than 3Å [2]. As the resolution drops 20 from 3 to 5Å, modeling becomes increasingly difficult; the carbonyls become indistinguishable 21 from the backbone density, side chain details are lost, and eventually, the backbone trace is no 22 longer visible. Hand-built models at these resolutions can contain sequence registration errors, 23 poor secondary structure, improper tracing of the backbone through the density, and incorrectly 24 placed backbone carbonyls. There are several instances of models that have been deposited and 25 published with errors that are found later by the community [3,4]. Methods like AlphaFold and 26 RoseTTAFold [5,6] may help in alleviating these errors, but these methods' inability to model 27 structures with multiple conformations and their limited accuracy in modeling protein complexes 28 will still lead to model errors. 29 Previous efforts in identification of model errors rely on metrics that primarily fall into 30 one of two categories: model quality metrics that focus on atomic geometry [7,8], and fit-to-31 density metrics that focus on local fit-to-density [9][10][11][12][13]. Model quality metrics, such as the 32 fraction of Ramachandran outliers, are not precise enough to catch local mistakes at these 33 resolutions. Refinement protocols can easily push a wrong model to have good quality under 34 these metrics. CaBLAM addresses this by defining a dihedral for the carbonyls in relation to the 35 backbone trace and identifies when this angle deviates from expected values; however, due to its 36 high cutoff value, CaBLAM is unsuitable for residue-level accuracy [11]. Density-based metrics 37 have two major weaknesses: many are noisy at the level of individual residues and are better 38 suited to evaluate a model's global quality [12,13], while density-based metrics that robustly 39 evaluate local fit rely heavily on side chain density, making them less reliable at resolutions 40 below 3.5Å [14]. Furthermore, microscopists have a tendency to overfit their models to low-41 resolution density, so density fit by itself is not always enough to evaluate whether an error has 42 been made [15]. 43 Here, we present MEDIC (Model Error Detection in Cryo-EM), a statistical model that 44 weighs the contributions of structural information with local model-map agreement to identify 45 residue-level errors in a cryo-EM structure. The structural features of our model include both 46 energy-guided metrics and a predicted error from a machine learning model trained to 47 discriminate native and decoy structures. The use of a machine learning model to assess model 48 geometry allows evaluation of non-bonded interactions such as hydrophobic burial, making it 49 robust when used with lower-resolution data. We combine these structural features with a 50 measure assessing agreement to density conditioned on data collected at a wide range of 51 resolutions. We show reliable detection of errors on a set of 28 structures which were later 52 solved to higher resolutions. On a smaller set of 12 deposited structures, we correct over 100 53 mistakes marked by our protocol with existing tools. Finally, we demonstrate that MEDIC can 54 guide rebuilding in areas where AlphaFold models cannot. 55

RESULTS 56
An overview of training and usage of MEDIC is shown schematically in Figure 1.

57
MEDIC is trained to predict a probability of error for every residue, based on three features 58 ( Figure 1A): energy guided metrics for Ramachandran angles and bond deviations from 59 Rosetta's energy function [16], expected fit-to-density for a residue given the local resolution 60 and the amino acid identity, and predicted model error from DeepAccuracyNet [17]. 61 DeepAccuracyNet is a deep convolutional neural network trained to distinguish native protein 62 structures from Rosetta-determined decoys. It predicts per-residue local Distance Difference Test 63 (lDDT), a measure of the number of atom pair distances that are maintained between a native 64 structure and a decoy [18]. For our fit-to-density metric, we used masked real-space cross-65 correlation to measure density fit, and then normalize that value based on statistics for each 66 residue identity at its local resolution, gathered from a set of deposited map-model pairs between 67 the resolutions of 1.5 to 5Å. 68 Given these three features, a combined model was trained using a set of seven obsoleted 69 protein structures which had been edited months after the initial deposition, presumably to 70 correct structure errors. Our combined logistic regression model was trained to predict the 71 residues that changed between the original and most-recent deposition. We validated this initial 72 model on an additional 3 obsoleted structures which had been withheld from training. We 73 compared MEDIC's error probabilities to the residues that changed between these depositions 74 and found that MEDIC had a precision of 76% at a recall of 60% (Supplemental Fig 1). Given 75 the high performance on this initial set, we then used this model to generally evaluate deposited 76 structures ( Figure 1B). Throughout our analysis, we divide these probabilities into three 77 categories: definite error, possible error, and non-error (see Methods). Data analysis is performed 78 only with the high probability errors, while each image is colored according to the three 79 categories. 80 Validation on low resolution structures later solved to higher resolutions 81 To validate our approach, we considered EMDB-deposited structures between 3.5 and 5Å 82 resolution, which were subsequently solved to better than 3.5Å (and at least 1Å better than the 83 original deposition). There were 68 cases, of which we manually removed 40 with domain 84 orientation changes between the high-resolution and low-resolution structures. The results on this 85 dataset are summarized in Figure 2A. On this set of 28 structures, our method has a precision of 86 68% at a recall of 60%. This compares favorably with the widely used density-only metric, Q-87 score [9], which has a precision of only 35% at the same 60% recall. 88 We next examined which features were predictive of the true positives identified by 89 MEDIC. Approximately 81% are predicted by the lDDT alone, while the remaining 19% require 90 at least 2 features to be considered an error. The reliance on lDDT to predict most of the errors 91 could be because of bias in the training set, which primarily contains long segments that were 92 corrected. It might also simply reflect the types of errors microscopists tend to make; hand-built 93 models are much more likely to fit the density well but have poor geometry and structural 94 features. 95 Some of the errors identified by MEDIC in the low-resolution structures are highlighted 96 in Figure 2B, with the corresponding model in its high-resolution density map in Figure 2C. In a 97 structure of a voltage-gated calcium channel (PDB 5GJW), it is difficult to trace the backbone 98 while properly accounting for the large aromatic side chain density ( Figure 2B, top panel). The 99 mistake is identified by MEDIC with relatively equal contributions from the lDDT and bond 100 geometry scores. Likewise, the error found in an insulin degradation enzyme (PDB 6B70) is 101 captured by multiple features, this time the density and bond geometry scores ( Figure 2B, middle 102 panel). The backbone is hardly visible in the density map, which may explain why the 103 microscopists had difficulty properly fitting the serines into the density. In contrast, the mistake 104 found in a transmembrane channel (PDB 6M66) is dominated by the lDDT score ( Figure 2B, 105 bottom panel). It would be difficult to catch this error by visual inspection, as the model seems 106 reasonable given the density. 107 To better understand any shortcomings of MEDIC, we looked at two structures for which 108 our performance was worse than the aggregate results. In a partial complex of an ATP synthase 109 (PDB 6F36), MEDIC falsely marks an entire stretch of residues as a mistake ( Figure 2D) 110 because it does not see the proper structural context for this particular sequence as it is 111 unmodelled in the low-resolution structure ( Figure 2E). The other case which MEDIC performed 112 poorly on, a dehydrogenase (PDB 7E5Z), contained many errors fewer than 3 residues in length 113 which MEDIC failed to identify, two of which are shown in Figures 2F-I. We fail to mark an 114 incorrect carbonyl as an error in the low-resolution model ( Figure 2F) that is supported by the 115 higher-resolution data ( Figure 2G). However, we find zero high-probability errors in a region of 116 the low-resolution model ( Figure 2H) which appears to be an error in the high-resolution model 117 ( Figure 2I). 118 Given our worse performance on the errors in the dehydrogenase (PDB 7E5Z), we 119 manually examined 30 differences across 4 low-resolution structures that MEDIC failed to 120 identify. Among these, 16 were mistakes in the model built against low-resolution data, while 14 121 were either ambiguous in the high-resolution density or seemingly incorrect in the high- where the high-resolution structure has an error (Supplemental Fig. 2A-B), and two more where 124 the high-resolution structure is not supported by the density (Supplemental Fig. 2C-F). 125 Using MEDIC to guide model rebuilding 126 With the understanding that MEDIC is relatively precise when identifying errors, we next 127 wanted to assess the usefulness of the model to aid in a manual structure rebuilding process. To 128 that end, we evaluated MEDIC on a selection of 12 models with diverse topologies and 129 resolutions and attempted to fixusing Rosetta refinement tools and AlphaFoldall the 130 segments marked as errors (see Methods). There were 237 segments predicted to be definite 131 errors (with high error probability), 33 of which were disordered regions with little or no visible 132 density (Supplemental Fig. 3). Of the remaining 204 segments, 133 (65%) were 1-3 residues in 133 length, 38 (19%) between 4-9 residues, and 33 (16%) were greater than 10 residues. We were 134 able to rebuild and fix 120 (59%) of these segments; for an additional 26 segments, we were able 135 to significantly reduce the number of definite errors in that region. The fixable mistakes included 136 2 sequence registration errors, where the sequence is shifted on the backbone relative to the 137 correct placement, 51 incorrect loops, 51 cases of poor secondary structure, and 16 flipped 138 carbonyls (Table 1). 139 A representative subset of errors that our method was able to address are highlighted in 140 Figures 3 and 4. In these cases, we were able to correct 2 significant sequence registration errors 141 ( Figure 3). Figure 3A compares the deposited structure of a lipid scramblase (PDB 6E1O) with 142 our new model. Notably, our model has better hydrophobic packing and we explain the large 143 side chain density with a phenylalanine as opposed to a lysine residue ( Figure 3B). This 144 sequence registration error was propagated from a previously solved crystal structure (PDB 145 4WIS), in which the density for this region was poorly resolved. In both structures, this helix is 146 preceded and followed by unresolved regions, making proper sequence placement more difficult. 147 Conversely, the sequence registration error found in a hedgehog receptor (PDB 6DMB) occurs 148 because the pitch of the helix is not visible in the density ( Figure 3C). The addition of a bulge in 149 the repaired model ( Figure 3D), justified by the preceding proline, pushes a phenylalanine into 150 large side chain density which was poorly explained by an alanine in the original model. 151 MEDIC is also capable of finding gross backbone errors, including areas with poor 152 secondary structure and incorrect loops. In Figure 4A, it is clear by eye that the beta strands of 153 this kinesin motor domain (PDB 5MM4) have poor hydrogen bonding. Upon fixing the 154 secondary structure ( Figure 4B), our method marks these regions as correct, as MEDIC balances 155 proper structural features with density fit. In addition to identifying poor structural features, 156 MEDIC can recognize if a stretch of residues is assigned the incorrect secondary structure, such 157 as the region from a hedgehog receptor (PDB 6DMB) depicted in Figure 4C. However, our fixed 158 model is supported by more than the lDDT score; it has less unexplained density, which is 159 reflected by large improvements in the density scores ( Figure 4D). 160 Furthermore, MEDIC can identify some shorter, subtler backbone errors, such as 161 incorrectly placed carbonyls, by combining multiple features ( Figure 4E-H). The deposited 162 model of the bluetongue virus (PDB 3J9E) has a Ramachandran angle that falls just in the 163 "Allowed" region ( Figure 4E). MEDIC uses the lDDT and the bond geometry scores to predict 164 this error, and after rebuilding, both Ramachandran angles and density fit improve ( Figure 4F).

165
Similarly, the structure for a neurotoxin (PDB 7QFQ) contains Ramachandran angles which 166 Molprobity also classifies as "Allowed" ( Figure 4G). We find this error with relatively equal 167 contributions from lDDT, density, and geometry energies. The rebuilt model improves the 168 density fit for the tryptophan and alanine residues while removing the problematic 169 Ramachandran angles ( Figure 4H). Of the over 1300 residues identified as errors across these 12 170 models, approximately 66.5% of them were predicted by the lDDT score alone, 1.4% by the 171 density, and 0.4% by the Ramachandran energy, while 32% required at least 2 features. 172 To quantify MEDIC's performance on this set of structures, we used the differences 173 between the deposited structures and our rebuilt models (see Methods) to determine that MEDIC 174 has a precision of 67% at recall of 60% (Supplemental Fig. 4). The increased performance of 175 MEDIC at high recall values compared to the low-vs. high-resolution validation set could be 176 attributed to a few factors. In the set of validation structures, it is possible that the high-resolution 177 models may contain errors. Moreover, there could still be conformational differences between 178 the high-and low-resolution structures, such as flexible loops or shifts that occur at interfaces 179 contained in only one of the depositions. Both would hurt MEDIC's perceived performance. 180

Identifying errors in all deposited structures in the EMDB 181
After confirming MEDIC's high accuracy and utility in model building, we ran MEDIC 182 on all structures in the EMDB between the resolutions of 3 to 5Å to gauge the reliability of the 183 method on over 1500 depositions. The aggregate statistics from this run are shown in Figure 5.

184
Upon inspection, several models were composed of docked crystal structures with no visible 185 density for one or more domains, so we removed residues with a model-map correlation of less 186 than 0.4. In Figure 5A, we show the fraction of residues marked as errors in every EMDB 187 deposition. There is only a slight trend with resolution, which is unsurprising given that as we 188 move to lower resolutions, microscopists are more likely to dock crystal structures or use 189 homology modeling than hand-build structures. Because cryo-EM maps are rarely homogenous 190 in resolution, we also report the fraction of residues marked as errors after grouping by atomic B-191 factors ( Figure 5B). At very low atomic B-factors (indicating well-resolved density), very few 192 errors are made. As the atomic B-factors increase, more mistakes are made. 193 We manually inspected the outliers in the data: maps with very high error fractions, and 194 errors with low atomic B-factors. Although the fraction of errors is greater than 40% on the 20 195 model-map pairs we examined, the errors do seem real. In some cases, entire domains have little 196 to no secondary structure (Supplemental Fig. 5A-B). All these structures were built pre-197 Alphafold, using outdated (then state-of-the-art) structure prediction software or by hand-tracing 198 into low-resolution data. Unsurprisingly, we find that 88% of the errors in this set are predicted 199 by the lDDT alone. In the structures that contain errors with low atomic B-factors, we find that 200 while some errors appear to be real, there also appear to be false positives. There are several 201 causes for the perceived false positives, including residues marked as errors because they are 202 involved with ligand or metal binding, or they correspond to very short disordered segments 203 (Supplemental Fig 5C-E). 204

Comparison to AlphaFold 205
Although it is clear that MEDIC can identify errors in hand-built structures, many 206 microscopists will now start model-building from an AlphaFold prediction [19]. We compare 207 MEDIC's performance to AlphaFold models, highlighting loops which we identified as an error 208 in the original deposition ( Figure 6A & 6D) and where AlphaFold predictions do not fit the 209 density. The loop predicted by AlphaFold for the motor protein, prestin (PDB 7S9D), would 210 require significant rebuilding ( Figure 6B). MEDIC identifies our new model, built with tools in 211 Rosetta, as correct ( Figure 6C). The shorter loop predicted by Alphafold for the bluetongue virus 212 (PDB 3J9E) is not only a poor fit to density ( Figure 6E); the carbonyls are placed incorrectly 213 when compared to our final model ( Figure 6F). Of the 12 models we rebuilt, 23 regions (from 7 214 different AlphaFold models) would have required rebuilding. AlphaFold was confident 215 (predicted lDDT > 70) in 10 of these regions, which means that modelers would need to 216 manually identify these mistakes, not just remove low confidence regions, and then rebuild, 217 presumably by hand. MEDIC will be useful for this editing process: our method was able to 218 identify that the deposited structure or our rebuilt model was correct in 18 of those 23 regions. In 219 the remaining 5 cases, we were unable to build a structure that satisfied MEDIC. 220

DISCUSSION 221
In this report, we develop a method for the identification of local errors in cryo-EM 222 models in the resolution range of 3-5Å. We validate our method on cryo-EM structures that have 223 later been solved to higher resolutions and show that MEDIC has a precision rate 30% better 224 than competing methods. We also highlight the use of MEDIC in the model building process by 225 identifying and correcting over 100 errors in a set of 12 deposited models and demonstrating 226 MEDIC's use in conjunction with AlphaFold. While many of the errors are predicted by lDDT 227 alone, we also find errors that make use of structure and density in tandem. Of the errors we 228 examined, we noticed that MEDIC erroneously marks the following: prolines, termini, residues 229 involved in binding, and regions where there is little to no supporting density (Supplemental Fig.  230 3). We believe prolines have a higher false positive rate because their geometry scores tend to be 231 higher and caution users to be critical of isolated prolines which MEDIC calls errors. 232 As it becomes more commonplace to model large protein complexes into lower 233 resolution density maps [20], validation metrics that can evaluate these structures and help guide 234 rebuilding are necessary. MEDIC's performance on structures with resolutions worse than 5Å 235 has not been tested and given that our statistics for density did not include these resolutions, it is 236 unclear how reliable our method will be in those cases. MEDIC could be extended to lower 237 resolutions by gathering more statistics and by measuring density fit across longer stretches of 238 sequence, making it suitable for use with cryo-electron tomography. A training set could be 239 curated from low resolution structures which are later solved to higher resolutions by removing 240 regions with different domain orientations and regions of ambiguity. Incorporating AlphaFold 241 models into the training set may also be useful, so that MEDIC more explicitly learns to find 242 regions which have good structural features but do not fit the density well. 243 AlphaFold has not only made it possible to model lower-resolution structures, it has 244 drastically changed the model building process for higher resolution structures as well. Now 245 microscopists will edit loops or interaction sites rather than build entire structures. For large 246 complexes, identifying and fixing errors in AlphaFold models can still be error-prone and time 247 consuming, especially if these are flexible regions solved to lower local resolutions. Creating a 248 program to automatically dock these models and fix any remaining errors would reduce the 249 amount of time and expertise necessary to solve structures. MEDIC could be used to guide this 250 rebuilding process; our method's high precision would substantially reduce the sampling space, 251 which makes the problem of automatically fixing local errors much more tractable. Based on the 252 observations described here, we believe that MEDIC will be a powerful validation tool for cryo-253 EM microscopists. 254

Preparation of input pdbs 256
Preparation of pdbs for training or for error detection is a three-step process. First, we 257 remove all ligands, nucleotides, or noncanonical amino acids. Then we refine the structure into 258 the density map, first with cartesian minimization and then with Rosetta's LocalRelax protocol 259 [21]. Finally, we perform B-factor fitting on the refined model. After this, all the scores for the 260 model features can be calculated. 261

Structural features 262
The energy guided metrics in our model are pulled from Rosetta's realistic energy 263 function [16]. Every pdb is refined in Rosetta as described above, so that the energy scores are 264 meaningful. Then, the energies for Ramachandran angles and bond deviations are evaluated for 265 each residue in the structure and fed directly into the model. 266 The final structural feature, predicted lddt, comes from DeepAccuracyNet [17]. Because 267 DeepAccuracyNet was trained on smaller structures, <300 residues in length, we run the model 268 on portions of the structure at a time: a sequence of 20 residues and the context within 20Å of 269 that query sequence. The predicted lDDT values are saved for only the query sequence and then 270 passed to the model. 271

Density feature 272
To calculate expected fit-to-density for amino acids, we collected statistics on a set of 24 273 deposited map-model pairs, using atomic B-factors as a substitute for local resolution. Each 274 model and was prepared as described above. The masked real-space density cross-correlation 275 was calculated for every residue and each was placed into a bin according to its amino acid 276 identity and the average B-factor of the residues within an 8Å neighborhood. A mean of the 277 cross-correlation scores was computed for each amino acid/B-factor bin and a standard deviation 278 was calculated across each B-factor bin. 279 Now that we have collected statistics, we can apply them during error prediction. The 280 means and standard deviations are used to transform the cross-correlation of each residue in a 281 protein model into a z-score. A very negative density z-score is indicative of a residue which fits 282 the density worse than expected, given its amino acid identity and the average B-factor. The 283 density z-score is then passed to the model. This process of collecting statistics and 284 transformation of raw scores is carried out for the cross-correlation of the residue by itself and 285 the cross correlation of a three-residue window centered on the residue of interest. 286

Training on obsoleted pdbs 287
We probed the RCSB for pdbs which had been edited after deposition, pulling all cryo-288 EM structures between 2.5 and 4Å resolution that had coordinates replaced [22]. Upon manual 289 inspection, 10 models of the 46 were chosen, eliminating cases where changes were made to 290 ligands, nucleotides or only rotamers, or where the obsoleted model didn't resemble a globular 291 protein. 3 of the 10 models were withheld from training and used for validation. 292 We now have a set of pdbs that contain mistakes made by microscopists and need to 293 generate labels for training, marking each residue in a model as an "error" or "non-error." We 294 compare the obsoleted pdb with the newer version, removing any domains or regions that exist in 295 only one of the models. Each residue in which the backbone atoms have an RMSD greater than 296 or equal to 1Å between the two models is marked as an error. To capture sequence registration 297 errors, any residue that appears in the obsoleted model but not the new version is marked as an 298 error. This process resulted in approximately 800 errors out of a total of 21000 residues. We then 299 trained a logistic regression classifier, with balanced class weights, to predict the errors using the 300 structural and density features. 301

Evaluation of error vs non-error 302
To determine the threshold at which a residue is an error, we chose a threshold value 303 from the precision-recall curve which balances the two statistical measures. We use both the 304 precision-recall from the 12 rebuilt models and the high-resolution low-resolution validation set 305 to choose thresholds. We consider every residue with a probability above 0.78 to be a definite 306 error. At threshold 0.78, MEDIC has a precision of 70% and recall of 80% on the set of 12 307 rebuilt structures and a precision of 78% and recall of 49% on the validation set. All statistics 308 and data analysis are done only with this more stringent threshold value. We consider every 309 residue with a probability between 0.78 and 0.6 to be a possible error. At a threshold of 0.60, 310 MEDIC has a precision of 52% and recall of 95% on the 12 rebuilt structures and a precision of 311 68% and recall of 61% on the validation set. Every residue with a probability less than 0.6 is a 312 non-error. 313

Calculation of error contributions 314
To determine whether a single feature is predictive of an error, we take the probability 315 equation that we have learned from the final training dataset (Eq. 1), where is the lDDT score, 316 the single residue density score, the 3-residue density score, the Ramachandran energy, 317 and is bond energy: 318 We replace all features, except the ones of interest, with the mean score, derived from the 320 scores of the EMDB depositions (over 1500 cases). For example, we replace lDDT, 321 Ramachandran and bond energies with the corresponding mean values to calculate how 322 predictive the density scores are. We then take the result from Eq. 1 and plug it into Eq. 2 to get 323 the final probability. If the final probability is above our threshold for definite errors, then that 324 residue is predicted by a single feature. 325 Error identification on deposited structures and retraining 327 We identified cryo-EM structures with less than 2000 residues and a resolution between 3 328 and 5Å. We then chose 12 structures with diverse topologies and resolutions to run through our 329 error detection, using the statistical model obtained from training on the obsoleted pdbs. We used 330 a probability threshold of 0.62, derived from the precision-recall curve for the small set of 3 331 withheld obsoleted structures. We chose a slightly lower threshold, sacrificing precision (60%) 332 for recall (85%) to ensure that we would find most of the errors. 333 After error identification, we attempted to rebuild every region that was predicted to be 334 an error, following the protocol described below. We then added these 12 models to our training 335 data. We generated error labels by looking first for residues with RMSDs greater than 1.5 after 336 rebuilding, for which the probability was greater than 0.5 and had dropped by 0.2 after fixing. 337 We also labeled residues with RMSDs between 0.5 and 1.5 with probabilities greater than 0.6 338 and that dropped by 0.2. Any 1-residue errors from this set were removed if they were not within 339 2 residues of other errors. These labels and scores were passed into the logistic regression with 340 the obsoleted pdbs, adding an additional 1200 errors to the dataset. 341

Model rebuilding 342
For each rebuild, we ran AlphaFold on the sequence [5], docking the model or separately 343 docking its domains into the density using UCSF Chimera [23]. Then, we removed all regions in 344 the deposited model that were identified as errors plus/minus 2-3 residues on either side of the 345 segment. We passed the AlphaFold models and the trimmed deposited model as templates to 346 RosettaCM [24]. We ran at least 2 rounds of iterative RosettaCM, passing the top 5 models out 347 of the total 50 to the next round. Additional rounds were run if model convergence for the top 5 348 was poor or if additional errors were detected by MEDIC and Molprobity. Any remaining 349 regions which AlphaFold or RosettaCM were not able to fix were built with RosettaES [25]. 350 Success in rebuilding was determined by how well regions matched the density by eye, 351 Molprobity scores, and MEDIC predictions. All images of these structures were made in 352 ChimeraX [26]. 353

High-and low-resolution structure validation 354
We pulled all cryo-EM structures between 3.5Å and 10Å for which there was another 355 deposition with the same UniProt ID and at least 1Å higher resolution, with a maximum of 3.5Å.

356
If the query structure had a model-map FSC greater than 10Å, the pair was thrown out. From this 357 initial pool of 68 structures, 40 pairs were tossed because there were significant conformational 358 changes caused by image processing, ligand binding, or physiological conditions. 359 For the remaining 28 pairs of structures, the high-resolution structure was docked and 360 refined into the low-resolution map, and the low-resolution structure was refined into its own 361 density [21]. The backbone RMSD between the two structures was calculated for every residue 362 and all residues with at least 1Å RMSD were labeled as errors. Residues that only existed in one 363 model of the pair were tossed and not used in validation. Error detection was then run on the 364 low-resolution structure using the statistical model from the larger dataset and precision-recall 365 curves were calculated with the described labels. 366

Comparison to Q-scores 367
To obtain a precision-recall curve for Q-scores, we first generated Q-scores for each 368 residue in the structure. We then subtracted the Q-score for the residue from the expected Q-369 score based on the global resolution for that map. This procedure mimics the usage of Q-score, 370 where modelers are advised to examine residues which drop below the expected value. The 371 difference between expected and actual Q-score is then used to calculate the precision-recall 372 curve. 373 Identifying errors in all deposited structures in the EMDB 374 We pulled every deposited cryo-EM structure with resolutions between 3 and 5Å, 375 removing approximately 300 structures for which the model-map FSC at 0.5 was worse than 376 10Å. Then we prepared each pdb as described above and ran the statistical model from the 377 combined dataset to detect errors. Of the 2037 structures that met our criteria, MEDIC 378 successfully ran on 1713 (87.4%). To remove regions of disorder, we toss out all residues for 379 which the density cross correlation is less than 0.4 in all subsequent analyses. 380

ACKNOWLEDGEMENTS 384
Funding for this research was provided by NIH R01-GM123089 (FD, GR). We are 385 grateful to Nao Hiranuma for DeepAccuracyNet support. after deposition, we marked every residue for which the backbone moved between the two 389 versions as an error (red) and collected scores from each of our features on all residues. These 390 labels and scores were fed to logistic regression, which gives us the statistical model, MEDIC. 391 (B) To use MEDIC, provide a map/model pair to the program. We calculate the scores for each 392 of our features, which are then passed to MEDIC. MEDIC predicts a probability that each 393 residue is an error, where higher probability is indicative of an error. 394   marked as an error with atomic B-factors between X-10 and X. 426