Abstract
Protein-protein interactions play a crucial role in almost all cellular processes. Identifying interacting proteins reveals insight into living organisms and yields novel drug targets for disease treatment. Here, we present a publicly available, automated pipeline to predict genome-wide protein-protein interactions and produce high-quality multimeric structural models.
Application of our method to the Human and Yeast genomes yield protein-protein interaction networks similar in quality to common experimental methods. We identified and modeled Human proteins likely to interact with the papain-like protease of SARS-CoV2’s non-structural protein 3 (Nsp3). We also produced models of SARS-CoV2’s spike protein (S) interacting with myelin-oligodendrocyte glycoprotein receptor (MOG) and dipeptidyl peptidase-4 (DPP4). The presented method is capable of confidently identifying interactions while providing high-quality multimeric structural models for experimental validation.
The interactome modeling pipeline is available at usegalaxy.org.
Introduction
Obtaining a complete map of interacting proteins is crucial to decipher the inner workings of living organisms. Among many other roles, proteins act in dynamic collaboration to fulfill biological functions by catalyzing chemical processes. Commonly, interactions are elucidated through a variety of experimental methods (Shoemaker, 2007, Zhou, 2016) which are capable of evaluating an ever-larger number of putative protein pairs. Unfortunately, the overlap between these methods is often limited which indicates either a high false positive rate or a low coverage. Often 40 to 90% of the detected interactions do not overlap between different methods (Mering, 2004, Rao, 2014). Also, high throughput methods do not provide structural insights into the formed protein-protein complex. More reliable methods such as crystallography and NMR spectroscopy do yield structural information but are labor intensive and as such only applicable to a limited number of proteins. In a recent study we demonstrated that the gap between low and high throughput methods can be bridged by identifying distantly related protein-protein homologues with similar protein-protein interfaces (Gong, 2021). Application of the SPRING method (Guerler, 2013) to Escherichia coli competitively identified protein-protein interactions while producing accurate multimeric protein structure models of which 39 by now have been confirmed in high-resolution experiments. Other studies applied our method to the minimal synthetic genome syn3.0 (Zhang, 2021) and the mouse genome (Li, 2016). In the present study we describe how we implemented our pipeline on Galaxy (Afgan, 2018), a web-based computational workbench used by many scientists across the world to analyze large data sets.
This allows scientists to reproduce, share and embed the resulting interactome networks within their own analysis pipelines. Given a set of query sequences and a list of known protein structures, the pipeline employs SPRING with HHsearch (Steinegger, 2019), and TMalign (Zhang, 2005) to detect and structurally model protein-protein interactions. We validate the pipeline’s performance by comparing the resulting Human and Yeast protein networks with experimental findings. Similar to the results for Escherichia coli, the method competitively resolves Human and Yeast protein-protein interaction networks. As novel targets, we identified Human proteins likely to bind the papain-like protease of SARS-CoV2’s non-structural protein 3 (Nsp3). We also obtained models for SARS-CoV2’s spike protein (S) in complex formation with myelin-oligodendrocyte glycoprotein (MOG) and dipeptidyl peptidase-4 (DPP4). Some of the detected interactions have already been experimentally confirmed in recent literature, others provide novel insights into the pathology of SARS-CoV2. Notably, the interaction with DPP4 has been suggested to cause a higher mortality rate of diabetics contracting SARS-CoV2 (Valencia, 2020) while the MOG receptor is associated with the MOG antibody disease which relapses in SARS-CoV2 patients (Woodhall, 2020).
Methods
Protein-Protein Interaction Analysis Pipeline
We present a Galaxy pipeline to predict and structurally model protein-protein interactions on genomic scale. The pipeline takes the following inputs:
An individual file or a pair of files containing multiple FASTA entries of protein coding sequences. The pipeline will attempt to identify protein-protein interactions within the set of query sequences.
Text file containing the list of all Protein Data Bank (PDB) (Berman, 2000) entry identifiers to be employed as a multimeric template library. This step can be skipped if the library has already been constructed.
PDB70 threading library files as provided by the developers of HHsearch. These files are used to perform the single-chain threading and can be obtained from http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/.
The following outputs are generated:
Tabular file containing all identified interactions with their corresponding templates and Zcom scores (Wong, 2021).
Tabular file containing the details of the produced multimeric structural models and the corresponding model properties, i.e. SM-score, TM-score, a knowledge-based contact energy term Econtact and the fraction of inter chain clashes.
Collection containing dimeric structural models for each interaction, including the structures of the identified templates from the PDB.
Bar chart displaying the prediction accuracy in comparison to experimental results derived from the BioGRID database.
Available Workflows
We build two analysis workflows, one for intra-genome and another for inter-genome protein-protein interaction prediction (see Table 1). The source code of the entire pipeline, including data preparation and interaction prediction logic is written in Python 3 and publicly available at https://github.com/guerler/springsuite under the GNU License. All methods can be executed locally or using the Galaxy web interface on usegalaxy.org.
Data Preparation
The presented pipeline utilizes all protein-protein interfaces available in the PDB as a template library for interface homology detection (Figure 1). The data preparation starts by using the DBkit tool in Galaxy to download all PDB entries and store them as a ffindex/ffdata database pair. As of November 29th, 2020 this amounted to 170,860 files. Then the SPRING Cross tool is applied which scans each PDB entry for protein-protein interfaces and stores the corresponding interacting PDB chain identifiers in a 2-column lookup table as a pairwise index of all interactions. In more detail, the SPRING Cross method proceeds by using the PDB REMARK 350 entries to build all bio units available in a given PDB entry. Then all C-alpha atom distances between two separate chains within the same bio unit are determined. If more than five distances below 10Å are detected for a pair of PDB chains, the corresponding PDB chain identifiers are deemed as interacting and added as a new row to the resulting 2-column lookup table. This yields a complete set of 988,784 interacting PDB chain identifier pairs contained in the PDB which we will use as a multimeric template library.
In a consecutive step the SPRING Map tool is applied which uses PSI-BLAST (Altschul, 1997) to detect close homologues of the PDB70 database for each PDB chain identifier listed in the columns of the lookup table. The identifiers of matching PDB70 entries are added in two additional columns to the lookup table. This allows us to apply HHsearch on a non-redundant subset of the PDB, containing entries with less than 70% sequence identity to each other. Although possible, expanding the monomeric threading database by including every PDB chain would significantly impact the database preparation time without improving the overall prediction performance. We used the PDB70 database issued on November 18, 2020 containing 58,900 entries. If a PSI-BLAST E-value equal to zero is used, 257,698 interaction frameworks which exactly match the sequences in the PDB70 are detected. With an E-value threshold of 0.001, the resulting 4-column lookup table contained 900,772 interaction frameworks suitable for the monomers available in the PDB70 database.
Interaction Prediction
The pipeline’s interaction prediction logic uses SPRING with HHsearch and TMalign, and was designed to exploit the redundancy of available protein-protein interfaces in order to predict and model novel protein tertiary structures.
Initially each query sequence Q is threaded by HHsearch against the PDB70 monomeric template library to identify a set of putative templates (Ti, i=1,2,…) each associated with a Z-score (Zi). The Z-score is defined as the number of standard deviations by which the raw alignment score differs from its mean. A higher Z-score indicates a higher significance and usually corresponds to a better alignment.
Considering all possible target sequence pairs, the SPRING Min-Z tool uses the previously described 4-column lookup table to select interaction frameworks which are shared by the monomeric templates of a query pair. The Z-score of the framework is defined as Zcom which is the smaller of the two monomeric Z-scores. A more detailed description of this algorithm is provided in Gong et al. 2021 and Guerler et al. 2013.
Interaction Validation
The accuracy of predicted protein-protein interactions is evaluated using the SPRING MCC tool. This tool compares the set of interactions from SPRING with interactions obtained from experimental methods contained in the Biological General Repository for Interaction Data sets (BioGRID) (Oughtred, 2020). BioGRID is an open access database that contains protein interactions curated from primary biomedical literature for all major model organism species and Humans. The SPRING MCC tool accesses the BioGrid Tab 3.0 format columns 24 and 27, containing the UniProt (UniProt Consortium, 2021) accession identifiers of interacting protein pairs. The method only operates on interactions identified for sequences which are available in the UniProt database.
Initially, the SPRING MCC tool produces a `negative` data set of non-interacting protein pairs by randomly sampling protein-protein interaction pairs from the set of query protein sequences. If a UniProt localization file is provided, the non-interacting pairs can be determined by sampling protein sequences from different subcellular regions. This approach can reduce the false-negative rate of the resulting negative data set.
Subsequently, the protein-protein interaction sets identified by each experimental method are considered to be truly interacting, constituting the `positive` data sets for the cross-validation process. Each method is compared to all other methods using the positive data sets and a negative data set of equal size. The resulting Matthew’s correlation coefficients (MCC) are plotted using the Matplotlib library (Hunter, 2007). An example of such a plot is shown in Figure 2, displaying the results for the Human and Yeast interactome validation. The legend lists the experimental methods used to determine the corresponding positive data set.
Structural Modeling
If a pair of proteins, Chain A and Chain B, is deemed to potentially interact e.g. Zcom > 25, the complex structure is constructed by structurally aligning the top-ranked monomer templates of Chain A and Chain B to all putative interacting frameworks using the SPRING Model tool which utilizes TM-align. The structural alignment is built on the subset of interface residues.
The resulting models are evaluated by the recently established SPRING model score (Vangaveti, 2020): where TM-score is the smaller TM-score returned by TM-align when aligning the top-ranked monomer models of Chain A and Chain B to the interaction framework; Econtact is a residue-specific, atomic contact potential derived from 3,897 non-redundant structure interfaces from the PDB using the formula of RW (Zhang, 2010). The weight parameter w0 is set to 0.01 through a training set of protein complexes to maximize the modeling accuracy of the interface structures. The final model is evaluated for clashes and removed if more than 10% of the resulting C-alpha atom contacts share a distance of less than 5Å between the interacting pair of protein structures.
Results
Performance Validation with Human and Yeast Interactomes
The pipeline’s performance is validated on 20,610 raw protein coding gene sequences from the Human Reference Genome (UP000005640) of the UniProt database. This process evaluates ~212 million possible pairs to identify the set of interacting protein-protein pairs. Each interaction is ranked by the Zcom score and Matthew’s correlation coefficient (MCC) is determined with regard to a negative data set of non-interacting protein pairs produced by the SPRING MCC tool and positive data sets derived from each experimental method. The negative data set has been sampled to contain proteins from different subcellular regions. Figure 2 displays the cross-validation performance results in comparison to ten experimental methods available in the BioGRID database. Note that we applied SPRING on the raw protein coding sequences without separating the individual proteins using the CDS record provided by GenBank (Benson, 2014). In total the 20,610 protein coding genes encode for about 75,776 individual proteins.
We repeated the same experiment using the Yeast genome (UP000002311) to identify protein-protein interaction networks. In total 6,045 protein coding genes were parsed through the pipeline evaluating ~18 million possible protein-protein interactions using the public Galaxy instance at https://usegalaxy.org. The results are shown in the right panel of Figure 2.
A more detailed analysis regarding the prediction performance of the presented pipeline versus experimental methods has been recently published for the Escherichia coli genome (Gong, 2021). The pipeline predicted several protein complex structures which were later experimentally verified by crystallography.
For all three genomes, our pipeline was able to implicitly identify individual protein sequences and achieve an overall performance which is comparable if not better than existing experimental methods.
SARS-CoV2 protease (Nsp3) and Ubiquitin
We next applied the Galaxy pipeline to the genome of SARS-CoV2 which causes a novel severe acute respiratory syndrome and has been declared a pandemic (Naqvi, 2020). The SARS-CoV2 genome contains 13 to 15 open reading frames with ~30 thousand nucleotides, including 11 protein-coding genes. Our pipeline identified Human substrates for the papain-like protease of SARS-CoV2 which is part of the non-structural protein 3 (Nsp3) (see Fig. 3).
Table 2 shows a list of the highest ranking fifteen substrates with matching multimeric templates and model quality attributes i.e. SM-score, TM-score, Econtact and Zcom. The two highest ranking interactions were identified for ISG15 (SM-score=1.11) and ANKUB (SM-score=1.10). ISG15 has recently been experimentally confirmed as a substrate (Shin, 2020) and ANKUB was suggested in a computational cleavage enrichment study (Prescott, 2020).
SARS-CoV2 Spike protein (S) and Myelin-oligodendrocyte Glycoprotein
We also identified Human proteins interacting with SARS-CoV2’s spike protein (S). The top-ranking interaction was found for angiotensin (ACE2/ACE), which is widely known to be the primary receptor for SARS-CoV2 (Peng, 2020, Zhou, 2020).
The second highest ranking model was detected for the interaction with the myelin-oligodendrocyte glycoprotein (MOG, see Fig. 4). MOG is a protein located on the surface of myelin sheaths in the central nervous system (Kezuka, 2018). Our pipeline modeled the monomeric structure of MOG with the highest ranking homologue in the PDB70 database which is PDB entry 4PFE (Eshaghi, 2015) at a Z-score of 102.2. We compared the resulting monomeric model with the model provided by Mesleh et al. in 2002. Both models resolve MOG as a beta-barrel and share significant similarity at a TM-score of 0.70. Additionally several suitable multimeric template frameworks were identified. The corresponding PDB entries are 7C8V (Li, 2020), 6XC2 (Yuan, 2020), 6XC4 (Yuan, 2020), 7BZ5 (Wu, 2020) and 7C01 (Shi, 2020). All of these structures, except 7C8V, were crystalized with a potent neutralizing antibody of SARS-CoV2. Table 3 shows the identified template frameworks and the resulting model scores.
The results indicate two distinct putative binding modes which may occur in tandem (see Fig. 4C).
The MOG receptor is associated with MOG antibody disease (MOGAD) a neuro-inflammatory condition that may cause inflammation of the optic nerve, the spinal cord and brain. Recent research has shown that SARS-CoV2 does trigger a relapse of MOGAD (Woodhall, 2020).
SARS-CoV2 Spike protein (S) and Dipeptidyl Peptidase-4
High-scoring models were also generated for dipeptidyl peptidase-4 (DPP4, see Fig. 5), confirming the computational modeling results presented by Li et al. 2020. DPP4 is a cell surface glycoprotein receptor involved in T-cell activation and assumed to play a role in cell adhesion, migration and tube formation (Durinx, 2000). Additionally, inhibiting DPP4 prevents glucagon release while increasing insulin secretion to decrease blood glucose levels (McIntosh, 2005).
DPP4 is known to interact with MERS-CoV (Wang, 2013). The highest scoring template frameworks for DPP4 were PDB entry 4KR0 (Lu, 2013) with a Zcom score of 216.30 and PDB entry 4L72 (Wang, 2013) with 213.4. The resulting dimeric models are very similar to each other.
The multimeric template matched the individual models with a TM-score of 0.62, a mean contact energy of −6.7 and SM-score of 0.69.
Conclusion
Accurate identification of protein-protein interactions is essential to decipher cellular processes and detect novel drug targets. In the present work we implemented a Galaxy pipeline using the SPRING method which detects and structurally models protein-protein interactions by identifying distantly related protein complex structures with similar protein-protein interfaces.
The presented pipeline yields insights into the biochemical activity of SARS-CoV2 by identifying distant homologues with similar binding interfaces to Human proteins. For the papain-like protease of the non-structural protein 3 (Nsp3), we detected several ubiquitin-like substrates of which some have been experimentally confirmed. The method produced a top-ranking model for SARS-CoV’s spike protein (S) and dipeptidyl peptidase-4 (DPP4) in alignment with existing literature. Our method produced novel complex models between the S protein and myelin-oligodendrocyte glycoprotein (MOG). Here two top-ranking binding modes were produced. The prediction confidence relies on the accuracy of the homology match between templates, the structural fit and a knowledge-based contact potential, providing likely binding modes and interaction partners for further investigation. Only additional experimental validation can determine which or if any of the predicted binding modes occur in nature.
A limitation of our method is that it may produce high-confidence models between proteins which are localized in different subcellular regions. Existing literature has shown that such cross-interactions occur in a significant number of cases. In the present work we avoid filtering predicted protein interaction pairs by their corresponding subcellular locations since this would bias the obtained Matthew’s correlation coefficients. Identifying an accurate set of truly non-interacting protein pairs is critical and particularly challenging for the evaluation of protein-protein interactions. Randomly sampling protein pairs across a genome may lead to the inclusion of interacting protein pairs. A more accurate method is to sample non-interacting sets by pairing proteins from different subcellular regions as presented here. Yet another common suggestion is to exclude homologue protein pairs from the non-interacting set all together in order to avoid the inclusion of interacting pairs. This however is not an option due to the nature of the presented method which relies on homology detection to predict protein-protein interactions.
Another limitation is that homology modeling does rely on experimental templates. All of the fifteen most confident models derived for Human proteins interacting with Nsp3’s protease rely on four crystallographic complex structures.
This pipeline demonstrates the ability to detect interactome networks for a range of organisms. The increasing number of resolved co-crystal structures in the PDB, will continually improve the model quality and coverage over time (Chandonia and Brenner, 2006). Since the pipeline includes all data preparation steps no manual adjustment is required once new data has been published to the PDB. Galaxy enables users to employ the pipeline within their own methodologies and add or modify steps as required using Galaxy’s web-based workflow editor. Users are now able to reproduce and share the resulting interactome networks. The present contribution expands the repertoire of Galaxy tools to structural modeling methodologies, making them available for a large number of users. Recent advances in protein structure prediction and modeling (Kryshtafovych, 2019) complements existing sequence analysis tools and provides novel targets for drug discovery and elucidating biochemical processes through structural insights.
Financial disclosure
This work is funded by NIH Grants U41 HG006620 and R01 AI134384 as well as by NSF ABI Grant 1661497 to AN and MS. Usegalaxy.eu is supported by the German Federal Ministry of Education and Research grants 031L0101C and de.NBI-epi to BG. Galaxy integration is supported by NIH grant R01 AI134384 to AN. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests statement
AG, DB, NC and AN are founders of and hold equity in GalaxyWorks, LLC. The results of the study discussed in this publication could affect the value of GalaxyWorks, LLC.
Acknowledgments
The authors are grateful to the broader Galaxy community for their support and software development efforts. Thanks to Jessica Hamby for proofreading the manuscript.