Abstract
Elucidating protein-ligand interaction is crucial for studying the function of proteins and compounds in an organism and critical for drug discovery and design. The problem of protein-ligand interaction is traditionally tackled by molecular docking and simulation based on physical forces and statistical potentials, which cannot effectively leverage cryo-EM data and existing protein structural information in the protein-ligand modeling process. In this work, we developed a deep learning bioinformatics pipeline (DeepProLigand) to predict protein-ligand interactions from cryo-EM density maps of proteins and ligands. DeepProLigand first uses a deep learning method to predict the structure of proteins from cryo-EM maps, which is averaged with a reference (template) structure of the proteins to produce a combined structure to add ligands. The ligands are then identified and added into the structure to generate a protein-ligand complex structure, which is further refined. The method based on the deep learning prediction and template-based modeling was blindly tested in the 2021 EMDataResource Ligand Challenge and was ranked first in fitting ligands to cryo-EM density maps. The results demonstrate the deep learning bioinformatics approach is a promising direction to model protein-ligand interaction on cryo-EM data using prior structural information. The source code, data, and instruction to reproduce the results are open sourced and available on GitHub repository : https://github.com/jianlin-cheng/DeepProLigand
1. Introduction
Proteins are a building block of life. They carry out many vital biological functions. Whether it is acting as an enzyme to accelerate the chemical reactions or as regulatory molecules binding to other molecules to activate their functions, the detailed characterization of proteins and their interaction with their binding partners (e.g., the natural substrates or drugs as ligands) is of great importance in understanding the overall biological systems, designing drugs and treating diseases. A fundamental objective of computational structural biology is to understand and model such molecular interactions of living systems in sufficient details so that the behavior of the system can be predicted or modified as desired. With an ultimate goal to characterize the thermodynamic and kinetic behavior of components and their interactions of living organisms, a near atomic resolution images of interacting molecules such as protein-ligand complexes is required to analyze and understand the physical and geometrical constrains of the molecules.
Cryo-EM, an acronym for cryogenic electron microscopy technique [1] is a revolutionary technology that enables the determination of 3D structure of macro-molecular complexes at atomic resolution. With the development of various techniques in the cryo-EM realm to generate high resolution maps, as seen in Figure 1, EMDataResource [2] has seen a surge in deposition of cryo-EM derived protein density maps which elucidate the protein and ligand interactions in the molecules. The EMDataResource 2021 Ligand Model Challenge [3] was hosted to rigorously benchmark the current methods for generating models using cryo-EM density maps to improve the prediction and validation of protein and ligand interaction, and to identify the metrics which are most suitable for comparing the fit of atomic coordinate models into the cryo-EM maps.
The growth of cryo-EM density maps and cryo-EM-derived protein structures. The statistics was obtained from EMDataResource [15], an unified data resource for 3-Dimension electron microscopy (3DEM) on November 22, 2022.
One of the most popular approaches on modeling the protein-ligand complexes is the molecular docking [4–8], which uses physics- or statistical potential-based molecular simulations to generate protein-ligand complex models and a scoring function for estimation of their binding affinities to rank them. However, even with significant research efforts, despite some success, the protein-ligand interaction prediction problem still remains unsolved because existing methods cannot leverage vast structural data effectively. Inspired by the recent success demonstrated by AlphaFold [9] which uses novel neural network architectures trained on thousands of experimental protein structures to improve the prediction accuracy of protein structures, we in this work have utilized a deep learning-based architecture as a core component of our modeling pipeline for the 2021 Ligand Model Challenge held from February 1 to April 1, 2021. We combine the deep learning-based protein structure prediction with the template-based protein-ligand interaction prediction to determine the structures of protein-ligand complexes. Based on the official results provided by the assessors of the challenge, our method performed best in fitting ligands to cryo-EM maps, measured across all targets, demonstrating the unique value of the novel deep learning bioinformatics approach for modeling protein-ligand interaction.
2. Materials and Methods
We attempt to solve the problem of protein-ligand interaction by using a set of bioinformatics methods that incorporates cryo-EM data and known structural information such as reference protein structures. In particular, we leverage the recent advance of applying deep learning to directly predict the structure of proteins from high-resolution cryo-EM density maps; a succinct review of the methods can be found on [10]. We utilize an existing deep learning-based tool as a key component of our model building pipeline (DeepProLigand), which predicts the 3D coordinates of protein complex using only cryo-EM density map as input. The protein complex model serves as start point for the downstream ligand positioning and model refinement tasks. The workflow as illustrated in Figure 2 demonstrates our approach to generate the structure of protein complex by incorporating a fully automatic deep learning-based method as its primary building block. The modeling pipeline of Figure 2 has three key steps described as follows:
The workflow of DeepProLigand generating protein complex structure from cryo-EM map and reference structure. The cryo-EM map (EMD-22898) illustrated in the workflow is of SARS-CoV-2 ORF3a ion channel in lipid nanodiscs [11].
2.1. Protein complex reconstruction from cryo-EM density maps and reference structures
Using DeepTracer [12] we first predict the 3D backbone coordinates of the protein complex directly from cryo-EM density map. DeepTracer uses a 3D U-Net architecture which is modified from the original 2D U-Net [13] architecture developed especially for biomedical image segmentation. The output from DeepTracer block is a predicted 3D backbone coordinate structure that has the carbon, carbon alpha, nitrogen and oxygen atoms in the Protein Data Bank Format (PDB), which is standardized by wwPDB [14]. The predicted structure reflects the conformation of the protein in the ligand binding mode. Because the reference structure of the protein (prior structure without cryo-EM information) is also provided, we use the structure alignment to combine them to generate a posterior structure, conceptually similar to combining the prior probability and likelihood to generate a posterior probability in the Bayesian reasoning. Specifically, in order to combine the reference structure and the predicted structure together in terms of geometrical alignment, we utilize the UCSF Chimera’s [16] matchmaker function to superimpose both structure together. Once the structures are superimposed, we save the superimposed structure relative to reference structure into a new PDB file. The new PDB file now contains the atoms of both reference and predicted structure into same geometrical space allowing us to average the coordinates of the corresponding backbone atoms and utilize the reference structure’s residue, chain labeling for all the shared components between the two structures. The side chains are added on top of the combined backbone structures by SCWRL4 [17] tool. Finally, a full-atom combined structure consisting of multiple chains is produced for the downstream processing. It is worth noting that our approach of generating protein complex structures is different from the traditional approach of fitting a reference structure into a cryo-EM density map.
Algorithm 1 depicts the pseudo code of averaging the backbone atoms’ coordinates of reference and predicted structures. We start by initializing an empty PDB file named average structure that follows the guideline of wwPDB [14]. For each residue of the reference structure, if the distance between the reference structure’s carbon alpha atom and any carbon alpha atom of predicted backbone structure in the 3D geometrical space is less than 1 Angstrom (Å), then all the backbone atoms’ coordinates of the residue in the particular reference structure are averaged with the predicted structure’s corresponding residue and saved in the average structure PDB file; otherwise, the coordinates of the residue in reference structure are simply saved in the average structure PDB file. We refer arithmetic mean, sum of collection of numbers divided by the count of numbers, as average throughout the paper. The distance threshold value we used is 1 Angstrom (Å), however, the threshold value for each chain can be modified as desired.
Here, let xr, yr, and zr be the coordinates for each carbon alpha residue of the reference structure and xp, yp, and zp be the coordinates for each carbon alpha residue of the predicted structure. With this notation, we follow the Algorithm 1 and generate new coordinates xavg, yavg, and xavg which are the average coordinates of both reference and predicted structure. The residues are averaged only if the distance between them, computed using Equation 1, is below a threshold value, we used threshold value as shown in Equation 2 for Target 202 and Target 203 of the 2021 EMDataResource Ligand Challenge. For Target 201, of the challenge, since most of the chains were turned into coils, we used threshold value of 0.3 Angstrom (Å) for chain C and 0.5 Angstrom (Å) for all other chains. The averaged coordinate structure is saved into standard PDB file format.
After the backbone atoms are computed using Algorithm 1, we utilize SCWRL4 [17] to add the side-chain conformation into the protein structure. The deep learning-based method utilized to predict the backbone atoms has high impact on determining the side-chains conformation as well, because high side chain accuracy is often achieved when the backbone prediction is accurate, as also demonstrated by AlphaFold [9].
2.2. Template-based Prediction of Protein-Ligand Interaction
After the protein structure that can accommodate ligands is generated using Algorithm 1, we utilize PyRosetta [18] to identify ligands and add them into the predicted structure by using the reference structure as template, as depicted in Algorithm 2. Since PyRosetta is a residue based tool, when a pose is created, all the atoms in a structure including ligand atoms are indexed by residue indices. Following Algorithm 2, we let res be each residue in reference structure that we want to check if it’s ligand. PyRosetta’s is_ligand function works by comparing the ligand to chemical component dictionary and returns a bool value (i.e, either True for ligand residue or False for non ligand residue) for each residue.
2.3. Refinement of Protein-Ligand Complex Model
After the prediction of protein-ligand complex structure using the approach outlined above, we further refined the predicted complex structure using Rosetta FastRelax. Relax does not perform extensive refinement and only searches the local low-energy backbone and side-chain conformations near the starting conformations by implementing rounds of packing and minimizing, with repulsive weight in the scoring function gradually increased from low value to normal value. The scoring function we used is ref2015_cst.wts, a default score function, which is repeated 5 times. Finally, after the refinement of the protein complex, we used UCSF Chimera’s Fit in Map function to do a rigid body optimization of the refined model. The 3D structure is rotated and aligned so that it fits to the density map. This refinement step is optional. During the blind experiment of the 2021 EMDataResource Ligand Challenge, we submitted both, an unrefined model and a refined model for each target.
2.4. Target cryo-EM Density Maps of 2021 Ligand Challenge
We blindly tested the protein-ligand modeling pipeline DeepProLigand on three targets that were released as 2021 EMDataResource Ligand Challenge targets from February to April 2021. The next section elaborates the three targets and the experimental setting used for each target.
2.4.1. Target 201: Escherichia coli beta-galactosidase
β-Galactosidase [19] target with atomic resolution of 1.9 Angstrom (Å) contains protein Beta-galactosidase, magnesium ion, sodium ion, water and 2-phenylethyl 1-thio-beta-D-galactopyranoside (PTQ) as ligand. The EMDB id of the target in EMDataResource is EMD-7770. We predicted the 3D structure of the complex using the workflow of DeepProLigand as highlighted in Figure 2 and during averaging of the structure, we initialize 0.3 Angstrom (Å) distance threshold for chain C and 0.5 Angstrom (Å) distance threshold for all other chains of the complex by re-initializing the threshold value of Equation 2. The reason for threshold of 0.3 Å in chain C was because most of the chains were turned into coils/turns with threshold of 0.5 Å. The ligand PTQ was appended using the Algorithm 2. Figure 3 shows the map-model overlay of cryo-EM density map EMD-7770 and our reconstructed protein structure model.
Target 201 (EMD-7770) map-model overlay at the recommended contour 0.52 (3.3 σ) with T0201EM0004_1 (ours)
2.4.2. Target 202: SARS-CoV-2 RNA-dependent RNA polymerase
The nsp12-nsp7-nsp8 complex bound to the template-primer RNA and triphosphate form of Remdesivir(RTP) [20] target with atomic resolution of 2.5 Angstrom (Å) contains RNA-directed RNA polymerase, Non-structural protein 8, Non-structural protein 7, Primer, Templete, ZINC ION, PYROPHOSPHATE 2-, MAGNESIUM ION, water, and [(2 R,3 S,4 R,5 R)-5-(4-azanylpyrrolo[2,1-f][1,2,4]triazin-7-y)-5-cyano-3,4-bis(oxidanyl)oxolan-2-yl]methyl dihydrogen phosphate as ligand. The EMDB id of the target in EMDataRe-source is EMD-30210. We predicted the 3D structure of the complex using the workflow of DeepProLigand as highlighted on Figure 2, and during averaging of the structure, we used 1 Angstrom (Å) distance threshold for all chains of the complex. The ligand F86 (remdesivir, covalent inhibitor) was appended using Algorithm 2. Figure 4 shows the map-model overlay of cryo-EM density map EM-30210 and our reconstructed protein structure model.
Target 202 (EMD-30210) map-model overlay at the recommended contour 0.058 (4.3 σ) with T0202EM004_1 (ours)
2.4.3. Target 203: SARS-CoV-2 protein 3a in lipid nanodiscs
SARS-CoV-2 3a ion channel in lipid nanodiscs [21] target with atomic resolution of 2.08 Angstrom (Å) contains ORF3a protein, Apolipoprotein A-I, water and 1,2-Dioleoyl-sn-glycero-3-phosphoethanolamine as ligand. The EMDB id of the target in EMDataResource is EMD-22898. We predicted the 3D structure of the complex using the workflow of DeepProLigand as highlighted in Figure 2, and during averaging of the structure, we used 1 Angstrom (Å) distance threshold for all chains of the complex. The ligand PEE was appended using the Algorithm 2. Figure 5 shows the map-model overlay of cryo-EM density map EMD-22898 and our reconstructed protein structure model.
Target 203 (EMD-22898) map-model overlay at the recommended contour 0.7 (10.3 σ) with T0203EM004_1 (ours)
3. Results
The analysis of the models in this section is based on the official results provided by the organizers of the 2021 Ligand Model Challenge. The fit to map for ligand is assessed by the Q-score [22] and the Z-scores. The Q-score measures how similar map values around an atom are to a Gaussian-like function we would see if the atom is well resolved. Q-score is calculated as a correlation between two vectors: u, which contains map values at points around the atom, and v, which contains values obtained from the reference Gaussian.
We used Q-score to compare the map to model fit for all the models that were submitted to the challenge. Table 1, shows the Q-score of ligand for all the models submitted for Target 201, our model is highlighted in bold for scrutiny. Figure 6 shows the ligand (PTQ)’s binding pose and orientation in our best predicted model, T0201EM004_1. Ligand PTQ binds to all four chains of Target 201, resulting in four binding sites for the ligand. We have visualized three binding locations for the ligand with its binding pose and orientations in Figure 6. Table 2 shows the Q-score of ligand for Target 202, similar to Target 201, our model is highlighted in bold for scrutiny. Figure 7 shows ligand (F86)’s binding pose and orientation in our best predicted model, T0201EM004_1. Ligand F86 binds to only one location in Target 202. We have visualized the binding location for the ligand with its binding pose and orientations in Figure 7.
Evaluation of Target 201: Escherichia coli beta-galactosidase on Q-score for all the models submitted in the 2021 Ligand Model Challenge
(A) T0201EM004_1 (ours) docked by Target 201 (EMD-7770) and visualized with electrostatic potential surface generated in UCSF Chimera. (B) Ligand PTQ, image extracted from Protein Data Bank (PDB). (C) Protein-ligand interactions in T0201EM004_1 (ours) model. Chains are colored differently (chain A: blue, chain B: pink, chain C: green and chain D: golden). The ligand is labeled with its atom names as well as the ligand name (PTQ). For chain D: golden and chain C: green, we have labeled the chain residue names for understanding protein-ligand interaction better.
Evaluation of Target 202: SARS-CoV-2 RNA-dependent RNA polymerase on Q-score for all the models submitted in the 2021 Ligand Model Challenge
(A) T0202EM0004_1 (ours) docked by Target 202 (EMD-30210) and visualized with electrostatic potential surface generated in UCSF Chimera. (B) Ligand F86, image extracted from Protein Data Bank (PDB). (C) Protein-ligand interactions in T0202EM004_1 (ours) model. Chains are colored differently (chain A: blue, chain B: orange, chain C: green, chain P: yellow, and chain T: teal). The ligand is labeled with its atom names as well as the ligand name (F86)
Table 3 shows the Q-score of ligand for Target 203. Similar to Target 201 and 202, our model is highlighted in bold for scrutiny in Table 3. Figure 8 shows the ligand (PEE)’s binding pose and orientation in our best predicted model, T0201EM004_1. Ligand PEE binds to two location in Target 203. We have visualized the binding location for the ligand with its binding pose and orientation in Figure 8.
Evaluation of Target 203: SARS-CoV-2 ORF3a putative ion channel in nanodisc on Q-score score for all the models submitted in the 2021 Ligand Model Challenge
(A) T0203EM0004_1 (ours) docked by Target 203 (EMD-22898) and visualized with electrostatic potential surface generated in UCSF Chimera. (B) Ligand PEE, image extracted from Protein Data Bank (PDB). (C) Protein-ligand interactions in T0203EM0004_1 (our) model. Chains are colored differently (chain A: blue, chain B: pink, chain C: green, and chain D: golden). The ligand are labeled with their atom names as well as the ligands name (PEE)
Figure 9 shows cumulative Z-scores on Q-scores of 17 groups participating in the 2021 Ligand Model Challenge. Our DeepProLigand predictor (EM004) performs best overall on all three targets. Specifically, our protein-ligand model is ranked first for Target 202, second for Target 203, and in the middle for Target 201 as shown in the Z-scores on Q-scores Figure 9. The Q-scores of our best model for the three targets are shown in Table 4. Even though there are too few targets to draw a definite conclusion, the good results indicate that the deep learning structure prediction in conjunction with the template reference structure is able to build a good protein structure framework to accommodate ligands and the template-based protein-ligand prediction can assemble the ligands with the protein structure well for some targets. Incorporating deep learning approach in modeling enables us to predict the protein structure directly from cryo-EM map within minutes, making the approach highly useful on both prediction accuracy and time.
Z-scores on Q-scores for ligand of all the models submitted to 2021 Ligand Model Challenge. The pointed arrow represents our model.
Q-score of our best model (EM004_1) for three targets
However, we also notice the limitation of our approach in terms of the geometric quality of the atoms in the predicted protein structure. Particularly, there are some atom-atom clashes in the models, which may be caused by the violations of some geometric constraints of atom-atom distances in the protein structure predicted by the deep learning as well as in the averaging of the coordinates of the predicted structure and the reference structure. The violations of geometric and stereochemical restraints are not fixed by the current refinement protocol in the prediction pipeline. The refinement protocol even introduced some new clashes into the model. As AlphaFold demonstrates that the well-trained sophisticated deep learning architecture can accurately capture the geometric restraints of atoms and bonds in protein structure by predicting high-quality protein structures of atomic resolution that are highly similar to natural protein structures, more advanced deep learning architectures can be developed to predict high-quality protein structures compatible with geometric and stereochemical restraints of proteins from cryo-EM density maps and reference structures.
4. Conclusion and Future Work
In this work, we demonstrate that the deep learning prediction of protein structure from cryo-EM maps can generate good protein structures for constructing protein-ligand complexes and the template-based protein-ligand interaction prediction can fit ligands well into the predicted protein structures according to the outstanding performance of our protein-ligand modeling pipeline. It is also worth noting that our method is fully automatic and does not involve any manual tweaking of the models to improve the scores. As discussed before, the current protein-ligand prediction pipeline cannot resolve some violations of some geometric and stereochemical restraints of atoms in protein structures. In the near future, we plan to develop advanced end-to-end deep learning architectures similar to some components in AlphaFold to predict better protein structures from cryo-EM maps and reference structures. Moreover, we plan to design 3D-equivariant deep learning architectures like SE(3)-equivariant Transformer network [23–26] to tackle the problem of geometric constraints which are not addressed by current methods. Finally, an end-to-end direct deep learning prediction of the structure of protein-ligand complexes from cryo-EM density maps, reference structures and ligand information to fully automate all the steps of the entire pipeline in this work is also an interesting direction to pursue. We believe the application of deep learning approach to prediction of 3D structures of protein-ligand complexes leveraging cryo-EM and other related data is a promising avenue to accelerate the advancement of the protein-ligand interaction study [27]. With the proliferation of cryo-EM maps being deposited in EMDataResource database, the use of deep learning-based methods can help to determine the structure of the protein-ligand complexes rapidly and ultimately helps to expedite the drug discovery process.
Author Contributions
Conceptualization, JC; methodology, JC and NG; experiments and data collection, NG; data analysis, NG and JC; writing—original draft preparation, NG and JC; writing— review and editing, JC and NG; visualization, NG; supervision, JC; project administration, JC and NG; funding acquisition, JC. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported in part by two NIH grants (R01GM146340 and R01GM093123), Department of Energy grants (DE-AR0001213, DE-SC0020400 and DE-SC0021303), and two NSF grants (DBI1759934 and IIS1763246).
Data Availability Statement
The cryo-EM density maps are available in the EMDB website [2] with id: EMD-7770, EMD-30210, and EMD-22898. The submitted models metadata and further detailed information on 2021 Ligand Model Challenge can be accessed from the official challenge web-page [3]. The protein complex generated from this method, DeepProLigand, and the software program are released, and can be accessed through the GitHub repository: https://github.com/jianlin-cheng/DeepProLigand
Conflicts of Interest
The authors declare no conflict of interest.
Acknowledgments
We thank the organizers and assessors of the 2021 EMDataResource Ligand Challenge for providing the data for this research and anonymous reviewers for their valuable suggestions.
Footnotes
Tables 1, 2 and 3 are revised. Source codes are released.
Abbreviations
The following abbreviations are used in this manuscript:
- cryo-EM
- Cryogenic electron microscopy