Accurate prediction of transition metal ion location via deep learning

Metal ions are essential cofactors for many proteins. In fact, currently, about half of the structurally characterized proteins contain a metal ion. Metal ions play a crucial role for many applications such as enzyme design or design of protein-protein interactions because they are biologically abundant, tether to the protein using strong interactions, and have favorable catalytic properties e.g. as Lewis acid. Computational design of metalloproteins is however hampered by the complex electronic structure of many biologically relevant metals such as zinc that can often not be accurately described using a classical force field. In this work, we develop two tools - Metal3D (based on 3D convolutional neural networks) and Metal1D (solely based on geometric criteria) to improve the identification and localization of zinc and other metal ions in experimental and computationally predicted protein structures. Comparison with other currently available tools shows that Metal3D is the most accurate metal ion location predictor to date outperforming geometric predictors including Metal1D by a wide margin using a single structure as input. Metal3D outputs a confidence metric for each predicted site and works on proteins with few homologes in the protein data bank. The predicted metal ion locations for Metal3D are within 0.70 ± 0.64 Å of the experimental locations with half of the sites below 0.5 Å. Metal3D predicts a global metal density that can be used for annotation of structures predicted using e.g. AlphaFold2 and a per residue metal density that can be used in protein design workflows for the location of suitable metal binding sites and rotamer sampling to create novel metalloproteins. Metal3D is available as easy to use webapp, notebook or commandline interface.


27
Metalloproteins are ubiquitous in nature and are present in all major enzyme families. 1,2 The metals 28 predominantly found in biological systems are the first and second row alkali and earth alkali metals 29 and the first row transition metals such as zinc and copper. Zinc is the most common transition 30 metal (present in~10% of deposited structures) and can fulfill both a structural (e.g. in zinc finger 31 proteins) or a catalytic role in up to trinuclear active sites. Zn 2+ is an excellent Lewis acid and is most 32 often found in tetrahedral, pentavalent, or octahedral coordination. About 10 % of all reactions 33 catalyzed by enzymes use zinc as cofactor 3 . 34 Metalloproteins are well studied because metal cofactors are essential for the function of many 35 proteins and loss of this function is an important cause of diseases. 4 Industrial applications for 36 metalloproteins capitalize on the favorable catalytic properties of the metal ion where the protein 37 environment dictates (stereo)-selectivity. 5-7 To crystallize proteins, metal salts are also often added 38 to the crystallization buffer as they can help in the formation of protein crystals overcoming the 39 enthalpic cost of association of protein surfaces. Metal ion binding sites can be used to engineer 40 protein-protein interactions (PPI) [10][11][12] and the hypothesis has been put forward that one origin of 41 macromolecular complexity is the superficial binding of metal ions in early single domain proteins. 12 42 While simple metal ion binding sites can be rapidly engineered because initial coordination on 43 a protein surface can for example be achieved by creating an i, i+4 di-histidine site on an alpha-44 helix 13 or by placing cysteines in spatial proximity, 14 the engineering of complex metal ion binding  Metal3D and Metal1D is based on experimental Zn 2+ sites. Metal1D extracts coordination environments from LINK records, Metal3D is a fully convolutional 3DCNN trained to predict the metal density from voxelized protein environments. B In inference mode Metal3D predicts the location of a metal ion by computing per residue metal densities and then averaging them to obtain a global metal density for the input proteins. The ions can then be placed using the weighted average of voxels above a cutoff. For Metal1D all residues in the protein are scanned for compatibility with the probability map. Metals are placed at the geometric center of residues with high scores according to the probability map. A final ranking of sites is obtained using the probability map.
(within a 16 x 16 x 16 Å 3 volume) can then be averaged to yield a zinc density for the whole protein.
the coordinates of the metal ion itself. Both Metal1D and Metal3D can predict the coordinates of 118 putative binding sites. We therefore assessed their performance by comparing to recent binding 119 location predictors with available code/webserver: BioMetAll 26 and MIB. 23 The main tuning pa-120 rameter of MIB is the template similarity t, with higher values requiring higher similarity of the 121 templates available for the search in structurally homologous metalloproteins. BioMetAll on the 122 other hand was calibrated on available protein structures and places probes on a regular grid at all 123 sites where the criteria for metal binding are fulfilled. The main adjustable parameter for BioMetAll 124 is the cluster cutoff c, which indicates how many probes in reference to the largest cluster a specific 125 cluster has. We used the recommended cutoff of 0.5 requiring all chosen clusters to have at least 126 50% of the probes of the most populous cluster and used the cluster center to compute distances. 127 We first investigated the potential of all tools to detect the location of a zinc ion binding site in a 128 binary fashion (zinc site or no zinc site). We defined a correctly identified binding site (true positive, 129 TP) as a prediction within 5 Å of an experimental zinc site. In case a tool predicted no metal within 130 the 5 Å radius, we counted this site as false negative (FN). False positive (FP) predictions, i.e sites 131 where a metal was placed spuriously, were clustered in a 5 Å radius and counted once per cluster.

132
All tools were assessed against the held out test biological assemblies for Metal3D and Metal1D. 133 When the performance of MIB (t=1.25) and BioMetAll is compared against Metal3D with probability 134 cutoff p=0.75 we find that Metal3D identifies more sites (85) than MIB (78) or BioMetAll (75)  compared to MIB and BioMetAll. We removed 56 sites from the list of zinc sites in the test set (189 139 total) that had less than 2 unique protein ligands within 2.8 Å of the experimental zinc location.

140
The amount of correct predictions in this reduced set is almost unchanged for all tools (Figure 3) 141 indicating that most tools correctly predict sites if they have 2 or more protein ligands. For Metal3D 142 at p=0.75 and p=0.9 as well as MIB with t=1.9 all sites that are correctly predicted to contain a metal 143 are sites with more than 2 protein ligands. The number of false negatives is reduced for all tools by 144 about 50 sites indicating that most tools do not predict these crystallographic artifacts that might 145 depend on additional coordinating residues from an adjacent molecule in the crystal. Of all tools,

146
Metal3D has the least false positives (1 FP at p=0.9) and the highest number of detected sites (110 147 at p=0.25). The single false positive at p=0.9 does not contain a zinc ion but is a calcium binding site 148 with three aspartates and one backbone carbonyl ligand ( Figure S8). After assessment of how many sites the tools predict, another crucial metric is the spatial precision of the predictions. For the correctly identified sites (TP) we measured the mean absolute 151 distance (MAD) between experimental and predicted position ( Figure 4) indicating that for half of the predictions the model predicts at or better than the grid resolution of 155 0.5 Å.

156
BioMetAll is not very precise with a MAD for correctly identified sites of 2.80 ± 1.30 Å. BioMetAll 157 predicts many possible locations per cluster with some of them much closer to the experimental 158 metal binding site than the cluster center. However, it does not provide any ranking of the probes 159 within a cluster and therefore the cluster center was used for the distance calculation. Metal1D 160 (MAD 2.06 ± 1.33 Å) which identifies more sites than BioMetAll is also more precise than BioMetAll. 161 MIB t=1.9 detects sites with high precision (MAD 0.77 ± 1.09 Å) but it relies on the existence of 162 homologous sites to align the found sites.   AlphaFill is able to place the metal because of higher sequence identity to 6BMS (64%) (Figure 5 B). 206 For ZDHHC23 Metal3D is able to place the metal with high confidence (MAD 0.75 Å for site 1 and 207 0.48 Å for site 2, p>0.99 ) based on the single input structure alone.  To investigate the capabilities for protein engineering we used mutational data for first and 216 second shell mutants of the active site residues in HCA2 with corresponding K values from a 217 colorimetric assay. 50 For most mutants no crystal structures are available so we used the structure 218 builder in the EVOLVE package to choose the most favorable rotamer for each single point mutation 219 based on the EVOLVE-ddg energy function with explicit zinc present (modeled using a dummy atom 220 approach 51 ). The analysis was run for every single mutant and the resulting probability maps from 221 Metal3D were analyzed. For the analysis we used the maximum predicted probability as a surrogate 222 to estimate relative changes in K . For mutants that decrease zinc binding drastically we observe a 223 drop in the maximum probability predicted by Metal3D ( Figure 6).The lowest probability mutants

Discussion
Metal3D predicts the probability distribution of zinc ions in protein crystal structures based on a 230 neuronal network model trained on natural protein environments. The model performs a segmen-231 tation task to determine if a specific point in the input space contains a zinc ion or not. Metal3D 232 predicts zinc ion sites with high accuracy making use of high resolution crystal structures (<2.5 Å).

233
The use of high resolution structures is necessary because at resolutions greater than the average 234 zinc ligand coordination distance (2.2 Å) the uncertainty of the zinc location noticeably increases 43 235 which would likely hamper the accuracy of the site prediction.

236
In contrast to currently available tools, for Metal3D, it is not necessary to filter the training 237 examples for certain coordination requirements (i.e only sites with at least 2 protein ligands).

238
The model thus sees the whole diversity of zinc ion sites present in the PDB. Such a model is 239 advantageous since metalloprotein design workflows require models to score the full continuum of    The HCA2 application demonstrates the utility of Metal3D for protein engineering (Figure 2). 330 The thermodynamics of metal ion binding to proteins are complicated 63 and there are currently 331 no high-throughput based experimental approaches that could generate a dataset large enough 332 to train a model directly on predicting K . The data we use were obtained from a colorimetric 333 assay with very high affinity of zinc in the picomolar range. 52-56 More recent studies using ITC 63 334 instead of the colorimetric assay indicate lower K values in the nanomolar range for wild type HCA2. 335 We can therefore only use the colorimetric data to estimate how well the model can recapitulate  We present two metal ion location predictors: Metal3D based on 3D convolutional neural networks 350 and Metal1D based on distances and amino acid propensity maps. Metal3D is the first tool with 351 sub-angstrom level precision to predict the location of metal ions in proteins that does not rely on 352 searching for structurally homologous proteins in a database. We therefore anticipate different 353 applications such as protein-function annotation of structures predicted using AlphaFold2, 67 inte-354 gration in protein design software and detection of cryptic metal binding sites that can be used 355 to engineer PPIs. Such cryptic metal ion binding sites in common drug targets could also be used 356 to engineer novel metallodrugs. Many of these applications will allow us to explore the still vastly 357 untapped potential of proteins as large multi-dentate metal ligands with programmable surfaces.

360
The input PDB files for training were obtained from the RCSB 68 protein databank (downloaded 5th 361 March 2021). We use a clustering of the structures at 30% sequence identity using mmseqs2 69 to 362 largely remove sequence and structural redundancy in the input dataset. For each cluster, we check 363 whether a zinc is contained in one of the structures, whether the resolution of these structures is 364 better than 2.5 Å, if the experimental method is x-ray crystallography and whether the structure 365 does not contain nucleic acids. If there are multiple structures fulfilling these criteria, the highest 366 resolution structure is used. All structures larger than 3000 residues are discarded. We always use 367 the first biological assembly to sample the training environments. The structures were stripped 368 of all exogenous ligands except for zinc . If there are multiple models with e.g. alternative residue 369 conformations for a given structure, the first one is used. For each biological assembly we used 370 the symmetry of the asymmetric unit to generate a protein structure that contains all neighboring 371 copies of the protein in the crystal such that metal sites at crystal contacts are fully coordinated.

372
The train/val/test split was performed based on sequence identity using easy-search in mm-

376
For the analysis, we always used the biological assembly and not the symmetry augmented 377 structure. For the specificity analysis with respect to other transition metals, clusters from the PDB 378 were randomly sampled to extract 25 biological assemblies per metal.

379
By default all zinc sites in the test and validation set were used for the analysis. Since some of 380 the sites might be affected by the crystallization conditions, we also created a subset of all sites 381 that contained at least 2 amino acid ligands to largely exclude crystallization artifacts. To analyze 382 metal ion selectivity, we selected sites with at least 3 unique protein ligands to only use biologically 383 significant sites with a high degree of metal preorganization as such sites should exhibit more 384 selectivity for specific metals compared to sites with only 2 unique protein-ligands.  The protein structure is analyzed using the BioPandas python library. 71 To identify coordinating 403 residues, a per residue score is assigned by performing a geometrical search from a reference 404 point, defined as the coordinate of the most probable metal binding atom, within a search radius 405 considered as roughly twice the typical distance between the metal ion and the binding atom of 406 amino acids in proteins (2.2 ± 0.2 Å as determined from LINK records). The search radius used was 407 5.5 Å in order to be able to take into account also deviations from the ideal coordination. In the case 408 of amino acids which present more than one putative coordinating atom, such as e.g. histidine, the acids in clusters (defined as the ones within the chosen threshold with respect to the highest-scored 418 one) based on distance. This is done using scipy.spatial.distance_matrix and grouping 419 together highest-scored amino acids closer than twice the search radius. For each cluster, a site 420 prediction is made as a weighted average between the coordinates of the reference point of each 421 amino acid, using as weighting factor the amino acid score. For isolated amino acids with a high 422 score (e.g. a single histidine) the same score is assigned to the closest reference point from another 423 amino acid, to be able to compute the position of the metal as before. Possible artifacts resulting 424 from this fictitious score are resolved in the final step of the prediction.
After the metal has been placed the likelihood of the putative sites can be assessed by performing 426 a geometrical search centered on the predicted metal coordinates (within 60% of the search radius, 427 i.e 3.3 Å) and a final score is now assigned to the site. The final score is assigned in the same way 428 as the amino acid scores based on the probability map, and has the advantage of being able to sort 429 the predicted metal sites based on their frequency in the training set. A cutoff parameter is used to 430 exclude sites with a probability lower than a certain threshold with respect to the highest-scored one.  Table S2). The  The voxelization was implemented using ray. 73 457 Model training 458 We used PyTorch 1.10 74 to train the model. All layers of the network are convolutional layers with 459 filter size 1.5 Å except for the fifth layer where a 8 Å filter is used to capture long range interactions. 460 We use zero padding to keep the size of the boxes constant. Models were trained on a workstation 461 with NVIDIA GTX3090 GPU and 32 CPU cores. Binary Cross Entropy 75 loss is used to train the model.

462
The rectified linear unit (ReLU) non-linearity is used except for the last layer which uses a sigmoid 463 function that yields the probability for zinc per voxel. A dropout layer (p = 0.1) was used between 464 the 5th and 6th layers. The network was trained using AdaDelta employing a stepped learning rate 465 (lr=0.5, =0.9), a batch size of 150, and 12 epochs to train. from the number of clusters. Using the binary metric we assessed how good the models are at 508 discovering sites and how much these predictions can be trusted.

509
In order to assess the quality of the predictions, we additionally compute for all the true positive 510 predictions the mean of the Euclidean distance between the true and predicted site (mean absolute 511 deviation MAD). For Metal1D, MIB, and BioMetAll, MAD was computed for all predictions above the 512 threshold within 5 Å of a true zinc site where ∑ predicted sites ≥ ∑ TP. This was done as some 513 tools predict the same site for different residue combinations and we wanted to assess the general 514 performance for all predicted sites above a certain cutoff and not just for the best predicted site 515 above the cutoff. For Metal3D the weighted average of all voxels above the cutoff was used.

Model assessment Metal3D
To evaluate the trained models we monitored loss and how 519 accurately the model predicts the metal density of the test set. We used a discretized version of 520 the Jaccard index setting each voxel either as 0 (no metal) or 1 (zinc present). We tested multiple 521 different decision boundaries (0.5, 0.6, 0.75, 0.9) and also compared a slightly smaller centered box 522 to remove any spurious density at the box edges, where the model has only incomplete information 523 to make predictions.

524
The Jaccard index is computed as where is the array of voxels with predicted probability above the decision boundary and is 526 the array of voxels with the true metal locations also discretized at the same probability threshold.

HCA2 mutants
The data for human carbonic anhydrase 2 (HCA2) mutants was extracted 528 from refs 52-56 and the crystal structure 2CBA 49,80 was used. The zinc was modeled using the zinc 529 cationic dummy model forcefield 51 and we verified that energy minimization produced the correct 530 coordination environment. The Richardson rotamer library 81 was used with the EVOLVE-ddG 531 energy function to compute the most stable rotamer for a given mutation with the zinc present.

532
The lowest-energy mutant was used for the prediction of the location of metals using Metal3D.