Abstract
The crucial role of protein interactions and the difficulty in characterising them experimentally strongly motivates the development of computational approaches for structural prediction. Even when protein-protein docking samples correct models, current scoring functions struggle to discriminate them from incorrect decoys. The previous incorporation of conservation and coevolution information has shown promise for improving protein-protein scoring. Here, we present a novel strategy to integrate atomic-level evolutionary information into different types of scoring functions to improve their docking discrimination.
We applied this general strategy to our residue-level statistical potential from InterEvScore and to two atomic-level scores, SOAP-PP and Rosetta interface score (ISC). Including evolutionary information from as few as ten homologous sequences improves the top 10 success rates of these individual scores by respectively 6.5, 6 and 13.5 percentage points, on a large benchmark of 752 docking cases. The best individual homology-enriched score reaches a top 10 success rate of 34.4%. A consensus approach based on the complementarity between different homology-enriched scores further increases the top 10 success rate to 40%.
All data used for benchmarking and scoring results, as well as pipelining scripts, are available at http://biodev.cea.fr/interevol/interevdata/
1 INTRODUCTION
Proteins are key actors in a great number of cellular functions and often work in collaboration with others, thereby forming interaction networks. Knowledge of the detailed 3D structure of protein-protein interfaces can help to better understand the mechanisms they are involved in. Difficulties in the experimental determination of protein assembly structures have prompted the development of in silico prediction strategies such as molecular docking. When no homologous interface structure can be identified and used as a template, free docking is used instead, involving a systematic search where many interface conformations called decoys are sampled (Huang, 2014; Porter, et al., 2019). These decoys are then scored according to properties such as interface physics, chemistry, and statistics (Huang, 2015; Moal, et al., 2013). Guided docking approaches integrating complementary sources of information are also becoming increasingly popular (Koukos and Bonvin, 2019).
Over time, protein interfaces are submitted to evolutionary pressure to maintain functional interactions. Thus, protein interfaces tend to be more conserved than other regions on their surface (Mintseris and Weng, 2005; Teichmann, 2002) and signs of coevolution can be detected at protein interfaces, where potentially disrupting mutations are compensated for with mutations in contacting positions on the protein partner. These phenomena of conservation and coevolution can provide useful information in the analysis and prediction of their 3D interface structures (Andreani, et al., 2020). For example, evolutionary information is at the heart of increasingly popular covariation-based approaches, such as statistical coupling analysis (SCA) (Socolich, et al., 2005) or direct coupling analysis (DCA) (Morcos, et al., 2011), for structural proximity prediction of residues based on multiple sequence alignments (MSAs). These approaches can be used to guide protein folding or to supplement predictions of macromolecular interactions (Cocco, et al., 2018; Cong, et al., 2019; Simkovic, et al., 2017). The vast majority of protein interaction site predictors successfully use evolutionary information, be it by sequence conservation, sequence co-evolution, or through homologous structures (Andreani, et al., 2020).
Evolutionary information can also be especially useful to guide molecular docking (Geng, et al., 2019). The InterEvDock2 server implements a docking pipeline that uses evolutionary information (Quignot, et al., 2018; Yu, et al., 2016). It takes advantage of the spherical Fourier-based rigid-body docking programme FRODOCK2.1 (Ramírez-Aportela, et al., 2016) for the sampling step and hands out a set of ten most probable interfaces based on a consensus between three different scores, FRODOCK2.1’s mostly physics-based score, SOAP-PP’s atomic statistical potential (Dong, et al., 2013) and InterEvScore (Andreani, et al., 2013). InterEvScore extracts co-evolutionary information from joint multiple sequence alignments of the binding partners (called coMSAs), but unlike covariation-based approaches such as DCA cited above, InterEvScore needs only a small number of homologous sequences to improve discrimination between correct and incorrect decoys, by combining coMSAs with a multi-body residue-level statistical potential. As seen in the benchmarking of InterEvDock2, InterEvScore presents a high complementarity with SOAP-PP (Quignot, et al., 2018). As both scores are based on statistical potentials but SOAP-PP has an atomic level of detail, we hypothesised that a score integrating evolutionary information at an atomic scale might pick up on finer properties to better distinguish near-natives from the rest of the decoys.
In InterEvScore, evolutionary information is given implicitly at residue-level through coMSAs and combined with a coarse-grained statistical potential. A major challenge in deriving evolutionary information to an atomic level of detail is finding a suitable way of representing residue-scale information from coMSAs at an atomic level. Here, we present a novel strategy to couple evolutionary information with atomic scores to improve decoy discrimination. We reconstruct an equivalent and hypothetical interfacial atomic contact network for each interface decoy and each pair of homologs present in the coMSAs, by using a threading-like strategy to generate explicit backbone and side-chain coordinates. These models can, in turn, be scored with non-evolutionary atomic-resolution scoring functions such as SOAP-PP (Dong, et al., 2013) or Rosetta interface score (ISC) (Chaudhury, et al., 2011; Gray, et al., 2003).
Here, we show that including explicit evolutionary information improves the top 10 success rate of SOAP-PP and ISC by 6 and 13 percentage points respectively, on a large benchmark of 752 docking cases for which evolutionary information can be used (Yu and Guerois, 2016). It also improves the top 10 success rate of the residue-level statistical potential from InterEvScore by 6.5 percentage points. We then use a consensus approach to take advantage of the complementarity between different scores. The top 10 success rate of a consensus integrating FRODOCK2.1 with InterEvScore and SOAP-PP increases from 32% to 36% when including the homology-enriched score variants. A more time-consuming consensus combining all scores with an explicit homolog representation reaches 40% top 10 success rate.
2 METHODS
2.1 Docking benchmark
We evaluated docking methods using the large docking benchmark PPI4DOCK (Yu and Guerois, 2016), where unbound structures unavailable from experiments were modelled by homology from unbound homologous templates. Each case in PPI4DOCK is associated to a coMSA, i.e. a pair of joint MSAs for the two docking partners. To focus on cases with enough co-evolutionary information, we excluded antigen-antibody interactions and cases with less than 10 sequences in their coMSAs. Sampling was performed using FRODOCK2.1 (see supplementary methods section 5.1.1) and only the top 10,000 decoys were kept. Near-native decoys were defined as being of Acceptable or better quality following the criteria from CAPRI (Critical Assessment of PRediction of Interactions) (Mendez, et al., 2003). To focus the study on scoring performance, only cases that have a near-native within the top 10,000 FRODOCK2.1 decoys were used for benchmarking. This resulted in a final benchmark of 752 cases (supplementary Table 5-1).
Performance was measured by top N success rate, i.e. the percentage of cases with at least one near-native in the top N ranked decoys. We especially focus on the top 10 success rate traditionally used as a docking metric, and the top 50 success rate since consensus computation typically involves the top 50 decoys of each score (see section 2.2.1). Additional metrics are available in the supplementary information (section 5.1.2).
2.2 Scoring functions
In addition to FRODOCK2.1’s integrated score (Ramírez-Aportela, et al., 2016), we rescored decoys and their threaded homologs with InterEvScore, SOAP-PP, and Rosetta interface score (ISC).
InterEvScore combines co-evolutionary information taken from coMSAs with a residue-level statistical potential (Andreani, et al., 2013). It was re-implemented to accelerate the scoring step (see supplementary methods section 5.1.3).
SOAP-PP is an atomic statistical-based score integrating distance-dependent potentials learnt on a set of real complex structures and normalised on a set of incorrect PatchDock decoys (Dong, et al., 2013). Here, we use a faster in-house implementation of this score (see supplementary methods section 5.1.3).
Rosetta interface score (ISC) includes a linear combination of non-bonded atom-pair interaction energies and empirical and statistical potentials among other terms (Chaudhury, et al., 2011; Gray, et al., 2003). This score is calculated by subtracting the total energy of both monomeric structures from the total energy of the complex structure. Since Rosetta ISC is sensitive to small variations and clashes at the interface, we included high-resolution interface side-chain optimisation as a scoring option (see supplementary methods section 5.1.3). Decoys for which Rosetta scoring did not converge after 10 iterations were assigned the worst score for that case. As Rosetta ISC scoring can take up to a couple of minutes per structure, we scored only the top 1,000 FRODOCK2.1 decoys (noted later 1k) per case rather than 10,000 (noted 10k).
2.2.1 Consensus scores
The aim of the consensus is to preferentially select decoys supported by several scores. Consensus calculations were performed similarly to InterEvDock2 (Quignot, et al., 2018) to obtain a set of 10 most likely decoys depending on the agreement between several scoring functions. Here, we apply consensus scoring to combinations of 3 to 5 different scoring functions. For a given set of scoring functions, ordered according to their individual performances from best to worst performing, the top 10 decoys of each scoring function receive a convergence count based on the number of similar decoys (defined as L-RMSD ≤ 10 Å) that are found in the top 50 decoys of each other scoring function. The final 10 consensus decoys are selected iteratively by decreasing convergence count (if > 1). In the case of a tie, decoys are selected according to the ranking order of their respective scoring functions. Note that decoys are added to the top 10 consensus only if they are not structurally redundant with the already selected ones (L-RMSD > 10 Å). If necessary, the consensus list is completed up to 10 decoys by selecting the top 4, 3, 3 decoys for a consensus between three scoring functions (or the top 3, 3, 2, 2 or top 2, 2, 2, 2, 2 decoys for a consensus between four or five scoring functions, respectively).
2.3 Docking strategy to integrate evolutionary information
The proposed homology-enriched docking pipeline consists of four steps outlined in Figure 2-1. First, we dock query proteins A and B for which we are trying to predict the 3D structure of the complex using FRODOCK2.1 (Ramírez-Aportela, et al., 2016). This results in a set of rotational and translational transforms that define a maximum of 10,000 complex decoys (Figure 2-1A). In parallel, we construct coMSAs and subsample them to a subset of M pairs of homologs (proteins A1 and B1, A2 and B2, …, AM and BM, homologs of query proteins A and B respectively) (see section 2.3.1). We model the unbound structures of this subset of M pairs of homologs, using the threading function from RosettaCM’s pipeline (Song, et al., 2013) and the unbound query protein structures as templates (see Figure 2-1B and section 2.3.2). We then generate complex equivalents to each query decoy by applying the translational and rotational transforms obtained in the docking step to each pair of homologs. Figure 2-1C illustrates this reconstruction for the first pair of homologs (proteins A1 and B1). Finally, we average scores over the query decoy itself and its equivalent homolog decoys to obtain a final per-decoy score that integrates all the information (Figure 2-1D). Note that for one case, we have to compute (M+1) x N scores to obtain the final ranking of N decoys. The scoring functions we used are described in section 2.2. All steps of the pipeline are easily parallelisable to reduce end-user runtime, whether through MPI (sampling step) or by splitting over decoys (scoring steps).
2.3.1 Subsampling homologs in the coMSAs
Homologous sequences used in scoring were taken from the coMSAs provided with the PPI4DOCK benchmark, reduced to maximum M=40, and then to M=10 sequences (plus the query sequence) to limit computational time. Indeed, it was already seen with InterEvScore that co-evolutionary information can be extracted from alignments with as few as 10 sequences (Andreani, et al., 2013). The sequences in the coMSAs are ordered by decreasing average sequence identity with the query sequences. This is taken into account when sub-selecting sequences to keep a representative subset of sequences in both reduced coMSAs. Sequence selection was performed in three steps. First, the number of sequences was cut at 100, as in the InterEvDock2 pipeline. Then the alignment was filtered with hhfilter 3.0.3 (Remmert, et al., 2011) from the hh-suite package. hhfilter was applied with the “-diff X” option on the concatenated coMSAs and the value of X was adjusted for each case to return a reduced alignment with no more than 41 sequences (i.e. the query + 40 homologs). At this stage, we obtain the first set of reduced coMSAs with maximum 40 sequences, which we call coMSA40, and that are representative of the full diversity of the initial coMSAs. Finally, 11 equally distributed sequences (i.e. the query + 10 homologs) were uniformly selected within coMSA40 in order to preserve sequence diversity compared to the initial coMSAs (see supplementary methods section 5.1.4). The final set of reduced coMSAs is called coMSA10.
2.3.2 Threading models
Pairwise alignments between the template structure and the homolog sequence to be modelled were directly extracted from the reduced coMSAs. The templates used for threading were the unbound template structures provided in the PPI4DOCK benchmark (Yu and Guerois, 2016) (see supplementary methods section 5.1.5).
Rosetta’s threading programme, the first step in the RosettaCM pipeline (Song, et al., 2013), was used to thread the homologous sequences onto the template structure. We used Rosetta 3.8 (version 2017.08.59291). No insertion, N- or C-terminus were modelled. This resulted in gapped and mainly structurally conserved threaded models of the homologs, where backbone coordinates remained unchanged and side-chain rotamers were different from the template’s side-chains only if the residue type changed between the template and the homologous sequence (Figure 2-1B).
3 RESULTS
3.1 Consensus approach with implicit homology scoring
In previous work, we integrated evolutionary information implicitly at the coarse-grained level by scoring decoys with residue-based InterEvScore (noted IES) (Andreani, et al., 2013). In IES, for each decoy, we enumerate all residue-level interface contacts. We then use a res-idue-level statistical potential to score decoys by considering all sequences in a coMSA and assuming the same contacts were conserved in all homologous interfaces.
We also combined InterEvScore with complementary scores FRODOCK2.1 and SOAP-PP (supplementary Figure 5-1A) in a three-way consensus score, denoted Cons3, which prefer-entially selects decoys supported by several scores (section 2.2.1) (Quignot, et al., 2018; Yu, et al., 2016). Compared to individual scores, we observed a notable boost of about 8 points in the top 10 success rate using Cons3, which captures a near-native in the top 10 decoys in 32% of the cases (Table 3-1 and Figure 3-1A).
This complementarity between the examined scores, in particular SOAP-PP and InterEvScore, (supplementary Figure 5-1A) prompted us to attempt a more explicit integration of evolutionary information into the various scores. Following the pipeline described in methods section 2.1 (Figure 2-1), in the next sections, we include evolutionary information into individual scores InterEvScore and SOAP-PP through explicit atomic-level models of homologous decoys.
3.2 InterEvScore with explicitly modelled homologs
For efficiency, we represent homologs at atomic resolution by threading their sequences onto the query structure (section 2.3.2). As a first step to validate this new representation of evolutionary information, we test the performance of InterEvScore on these threaded models and compare it with the original InterEvScore. With the threaded models, atomic contacts are re-defined in each homolog at an explicit level, rather than implicitly deduced from the coMSAs as in the original InterEvScore. In practice, we calculate the threaded homolog version of InterEvScore (denoted IES-h) by scoring query decoys and their threaded homolog equivalents with the InterEvScore statistical potentials (section 2.3). The final score of each query decoy corresponds to the average score over the query decoy itself and its homologs.
Table 3-2 and Figure 3-1A show the performance of IES-h40, i.e. IES-h computed using threaded homologs from the set of reduced coMSAs with a maximum of 40 sequences (coMSA40, see section 2.3.1). Results for the original InterEvScore with complete coMSAs (IES) and coMSAs40 (IES40) are also shown for comparison. Reducing the number of sequences to 40 does not strongly affect performance in terms of the top 10 and top 50 success rates. However, the top 10 success rate increases from 23.8% to 27.0% when using explicit threaded models (IES-h40) instead of only implicit coMSA information (IES40). Of note, a variant of InterEvScore without evolutionary information, where only the query decoy gets scored by the statistical potential, has a much lower top 10 success rate of 20.5% (supplementary Table 5-2).
The difference in performance between IES40/10k and IES-h40/10k can be explained by the fact that, in IES-h40, contacts are not extrapolated from the query interface network anymore but are redefined in each homolog based on their modelled interface structure.
3.3 Homology-enriched SOAP-PP
Having explicit structures at atomic resolution corresponding to each homolog enables us to score them directly using an atomic potential such as SOAP-PP (Dong, et al., 2013), which might be able to better exploit the atomic detail of homologs for the final ranking of query decoys. As for the threaded version of InterEvScore, homology-enriched SOAP-PP (SPP-h40) consists in the average SOAP-PP score over all homologs including the query decoy itself.
SPP-h40 performs better than SOAP-PP on the query decoys alone (Table 3-3 and Figure 3-1A). Using threaded homology models in this way gives a large performance boost to SOAP-PP (+6 percentage points on the top 10 success rate). SPP-h40 also outperforms InterEvScore and IES-h40 (section 3.2) as well as the FRODOCK2.1 score (section 3.1).
3.4 Homology-enriched Rosetta interface score (ISC)
Since we build atomic-level homolog models of decoys, we can score them explicitly using a physics-based score such as Rosetta ISC. As Rosetta scoring is much more computationally expensive (about 750 times slower) than SOAP-PP and InterEvScore, to compute homology-enriched ISC, the number of decoys was reduced to 1,000 (as ranked by FRODOCK2.1) and the number of homologs to 10 (coMSAs10, section 2.3.1).
As above, homology-enriched ISC consisted in the average score of the query and its homologous decoys (ISC-h10). For easier comparison, homology-enriched InterEvScore and SOAP-PP were evaluated in the same conditions (i.e. 1,000 decoys and coMSAs10) (Table 3-4 and Figure 3-1B). Their success rates are very similar to those with 10,000 decoys and coMSAs40 (supplementary Table 5-3). Even though ISC on query decoys performs worse than SPP-h and IES-h, ISC-h10 largely outperforms the best-performing individual score, SPP-h10, with 34.4% top 10 success rate (259 cases) compared to 30.2% (227). With only 165 successful cases in common, SPP-h10 and ISC-h10 remain very complementary (supplementary Figure 5-1B).
Note that for scores calculated on the top 1,000 FRODOCK2.1 decoys, success rates are technically capped to 77.1%, as only 580 cases out of the 752 in our benchmark have a near-native within this subset of decoys. In light of this, the ISC-h10/1k performance is all the more remarkable.
3.4.1 Using ISC to re-score homology-enriched decoys
ISC-h10 showed the highest top 10 success rate from all scores tested above, but scoring 1,000 × 11 decoys with Rosetta ISC is excessively time consuming in a generalised docking context as it takes approximatively 137 CPU hours per case (supplementary Table 5-4). One way to alleviate the total scoring time is to score only a pre-selected amount of decoys, using Rosetta ISC as a second step in the scoring pipeline.
In Cons3, we pre-selected the top 50 decoys of FRODOCK2.1, InterEvScore, and SOAP-PP. Similarly, here we use the top 50 decoys of the top-performing homology-enriched score variants tested above, namely SPP-h40/10k and IES-h40/10k, as well as FRODOCK2.1. These scores have a high complementarity in terms of top 10 success rate with only 67 cases found in common between all three (supplementary Figure 5-1C). Using this subset of 150 pre-selected decoys for ISC scoring (referred to with /150h) reduced scoring times approximately by a factor 7. We enrich near-natives in this set of 150 decoys since they were pre-selected by three already well-performing scores, but only 476 out of 752 cases in our benchmark possess a near-native in this subset.
In terms of the top 10 success rate, both ISC-h10 and ISC perform better on 150 than 1,000 decoys with 36.0% and 29.0% top 10 success rate instead of 34.4% and 20.9%, respectively (Table 3-5 and Figure 3-1B). Here again, the addition of evolutionary information to ISC through the threaded homolog models remarkably increases its performance. ISC-h10/150h has the best performance of all tested scores so far, for a much lower computational cost than ISC-h10/1k.
3.5 Homology-enriched consensus scoring
As a first step, we calculate Cons3-h, the homology-enriched variant of the Cons3 base consensus score presented in section 3.1. Calculating a three-way consensus using higher-performing homology-enriched variants (Cons3-h) instead of their original counterparts (Cons3) increases the top 10 success rate from 32% to 36% (Table 3-6 and Figure 3-1A). Consensus Cons3-h performs as well as ISC-h10/150h, while calculated on the same top 150 decoys, and computation is about 20 times faster for Cons3-h than for ISC-h10/150h.
Out of the 271 successful cases for Cons3-h and ISC-h10/150h, only 199 cases are common. Moreover, ISC and ISC-h10 remain complementary to SPP-h40/10k, IES-h40/10k, and FRO-DOCK2.1 (supplementary Figure 5-1D and Figure 5-1E). This led us to test four- and five-way consensus approaches to combine ISC optimally with other homology-enriched scores. We tested two four-way consensuses that integrate ISC without homology on 1,000 or 150 de-coys (Cons4-h/1k and Cons4-h/150h respectively) and two five-way consensuses that inte-grate ISC both with and without homology on 1,000 or 150 decoys (Cons5-h/1k and Cons5-h/150h respectively). Performances are reported in Figure 3-1B and Table 3-6, together with time estimates when parallelising the whole pipeline on 4 CPUs.
With five-way consensus Cons5-h/1k, the top 10 success rate rises to 304 cases (40.4%). Unfortunately, computation time strongly increases, since we have to compute ISC-h10 on 1,000 decoys. The most time-effective consensus, Cons3-h, has 36.0% top 10 success rate and the same top 1 success rate as Cons5-h/1k (Figure 3-1B and supplementary Table 5-5).
4 DISCUSSION
In InterEvScore (Andreani, et al., 2013), evolutionary information improved protein-protein scoring performance when given implicitly through coMSAs and coupled with a coarse-grained, residue-level statistical potential. Combining InterEvScore with complementary scoring functions FRODOCK2.1 and SOAP-PP by computing a consensus (Quignot, et al., 2018; Yu, et al., 2016) improved over the individual scores, reaching 32% top 10 success rate (see Table 3-1). However, this strategy did not take full advantage of the three scores’ complementarity and we thus decided to combine directly evolutionary information from coMSAs with atomic scores such as SOAP-PP. To this aim, we threaded coMSA homologs of docked query proteins and scored homologous decoys together with each query decoy.
With this new explicit implementation of evolutionary information, we tested a variant of InterEvScore where we scored decoys and their modelled homologs with a residue-level statistical potential. This modified version (named IES-h) had a slightly improved success rate compared to the implicit homology version (see Table 3-2). The explicit representation of homologous decoys enabled us to build homology-enriched versions of atomic scores SOAP-PP (SPP-h) and Rosetta ISC (ISC-h). For both, adding homology drastically improved top 10 success rates (see Table 3-3 and Table 3-4) even when coMSAs were down-sampled to a maximum of 10 homologous sequences. The Rosetta homology-enriched version, ISC-h10, had outstanding performances, but it also was the most time-consuming score, about 750 times slower than SOAP-PP or InterEvScore. The first compromise between computation time and performance was to run ISC-h10 on a pre-selection of 150 decoys defined by the top 50 decoys of SPP-h40/10k, IES-h40/10k, and FRODOCK2.1 (see Table 3-5). This score had the same top 10 success rate (36%) as a much faster consensus score involving the same top 150 decoys. Taking further advantage of this complementarity, different four- and five-way consensus calculations managed top 10 success rates from 36.7% to 40.4% at runtimes ranging from 45 minutes to 34.5 hours on four CPUs (Table 3-6).
Our homology enriched scoring scheme is robust to change in the definition of near-natives (supplementary Table 5-6) and in evaluation metrics (supplementary Table 5-7). Using a more stringent definition of near-natives (as being of at least Medium quality according to CAPRI criteria) still allows homology enrichment to boost predictive performance of scoring functions. However, consensus scores become less efficient than the best individual scoring functions, probably because when grouping decoys with a relatively loose similarity criterion (see methods section 2.2.1), we do not manage to selectively up-rank Medium quality decoys (supplementary Table 5-6).
We further tried to understand the origin of the large performance improvements obtained through homology enrichment. Scoring performance improves when near-natives are recognised better (positive selection) or when wrong decoys are down-ranked (negative selection). In the homology-enriched scores described in this work, correct decoys could be up-weighted by conserved interfaces in the homologous decoys and, at the same time, incorrect decoys could be discredited by statistically incompatible, clashing, or incomplete homologous decoys (since insertions in reference to the query structures were not modelled). We decided to first explore the simplest explanations, namely, deletions and/or clashes at the interface of homologs that would pull down the average score of the incorrect decoys. However, this does not seem to be the main driving force of ISC-h10’s success over ISC, as the number of gaps or the number of clashes (defined as heteroatom contacts under 1.5 Å) at the interface of homologous decoys do not strongly correlate with the given scores. Additionally, ranking using only the repulsive van der Waals component of the Rosetta score (fa_rep) performs very poorly in comparison to other scoring schemes with at most 34 out of 752 cases with correctly identified near-natives in the top 10 (supplementary Table 5-8). Finally, IES-h, SPP-h, or ISC-h variants where only the worst homologous decoys are taken into account when scoring each query decoy showed systematically worse performance than using the full range of homologous decoys for each query decoy (supplementary Table 5-8). This means that the performance of the homology-enriched scores is positively driven by the recognition of correct decoys rather than the exclusion of incorrect decoys through the presence of clashes or gaps.
Improvement of the SOAP-PP and Rosetta ISC scoring functions by homology enrichment is significant (supplementary Figure 5-2) and consistent over difficulty categories (supplementary Table 5-9). When splitting results over PPI4DOCK difficulty categories, we observe that the strongest relative gain for the SPP-h and ISC-h homology-enriched scores compared to their versions without homology occurs on “very_easy” cases, followed by “easy” cases (supplementary Table 5-9). A few cases are gained in the “hard” category, but the “very_hard” category remains largely inaccessible to the tested scores, even though our benchmark is limited to cases where at least one near-native decoy was sampled in the top 10,000 FRODOCK2.1 decoys (there are only 16 such “very_hard” cases). Consensus scoring also consistently improves results over the “very_easy”, “easy” and “hard” categories, in order of decreasing improvement. We hypothesise that correct ranking of very_easy and easy decoys is mainly dependent on the ability to score positively native-like models while more difficult categories would also require integration of flexibility, an ongoing challenge of protein docking (Desta, et al., 2020; Torchala, et al., 2013).
In this work, we developed a strategy to enrich scoring functions with evolutionary information by including atomic-level models for as few as ten homologs. This strategy improves the performance of several scores with different properties: InterEvScore (supplementary Table 5-10), SOAP-PP, and Rosetta ISC. This means that homology enrichment can in principle be applied to any scoring function with at most a ten-fold increase in runtime. This enrichment works with a very small number of sequences compared e.g. to the large MSAs needed by covariation methods to pick up coevolutionary signal, highlighting complementarity between the two approaches, which may be exploited by using additional DCA-derived constraints, e.g. in intermediate cases with a few hundred homologous sequences (Cong, et al., 2019; Simkovic, et al., 2017). The docking success boost also opens interesting perspectives regarding the large-scale application of structural prediction to interaction networks. Finally, with the rise of machine learning techniques in computational biology, one can expect interesting future developments using these approaches to further enhance the extraction of (co)evolutionary signal from coMSAs.
5 SUPPLEMENTARY INFORMATION
5.1 Supplementary methods
5.1.1 Docking parameters
In the docking pipeline based on FRODOCK2.1, all parameters were set to default except for the following. Docking with the frodock executable used the “-t O” option for “other” com-plexes (not enzyme and not antibody-antigen). Clustering with frodockcluster was run with the −d 4 option, i.e. setting a LRMSD threshold of 4 Å for clustering.
5.1.2 Alternative evaluation metrics: DockQ and nDCG
A more recent evaluation of decoy quality is given by the continuous DockQ score (Basu and Wallner, 2016), a metric going from 0 (bad quality) to 1 (high quality), which closely reflects the already-existing CAPRI quality categories. To integrate the DockQ score into a general performance measurement over our benchmark, we made use of the discounted cumulative gain (DCG) as in (Geng, et al., 2019). The DCG for each case is calculated as follows: where rank is the rank of the decoy, DockQrank is the DockQ score of the decoy with that rank and N is the top N decoys that are taken into account for this measurement. The 1/rank factor gives more importance to the quality of the top scoring decoys. An ideal DCG (iDCG) is also calculated in order to normalise the DCG by reordering all decoys by decreasing DockQ score. A final normalised value (nDCG) for each case is deduced by dividing the DCG by the iDCG and can be extrapolated into a single value by calculating the average nDCG over all cases in the benchmark. Note that to speed up computations, decoys with a fraction of native contacts under 0.1 were given a default DockQ score of 0.
5.1.3 Scoring functions
We employed an in house implementation of SOAP-PP that enables much more efficient scoring since decoy coordinates do not need to be explicitly generated. Note that only a slight reduction in performance on the 752 benchmark cases compared to the original SOAP-PP implementation has been observed (supplementary Table 5-11).
We also re-implemented InterEvScore for efficiency reasons. We introduced two variations compared to the best original InterEvScore (Andreani, et al., 2013): we defined interface contacts through distance thresholds (“distance mode”), instead of tessellation (“alpha mode” in InterEvScore) and we took evolutionary information into account for all interface residues instead of apolar patches only (so-called “standard mode” in the original implementation). InterEvScore outputs several scoring variants; here, we used the and the 2Bbest (Andreani, et al., 2013). In , each interface residue contributes to the final score through the potential of its best 2- or 3-body contact and the potential of its equivalents in the homolog sequences. was found to perform best when scoring with homolog sequences (InterEvScore with implicit homology) (Andreani, et al., 2013) and thus was used in this context. 2Bbest was used when scoring explicitly modelled side-chain models of our homologs (InterEvScore with explicit homology, IES-h). Indeed, we found that 3-body potentials are less discriminative than 2-body potentials in the context of explicitly modelled decoys (supplementary Table 5-12).
We use Rosetta 3.8 (version 2017.08.59291) and the beta_nov15 Rosetta score. Before scoring with Rosetta ISC, we perform high-resolution interface side-chain optimisation by using ‘use_input_sc’ and ‘docking_local_refine’ options of Rosetta’s docking_protocol executable. We also tried adding the ‘dock_min’ option (for even more conservative modelling and shorter scoring runtimes) but scoring results were degraded.
5.1.4 Details on coMSA calculation
Compared to the original PPI4DOCK database (Yu and Guerois, 2016), coMSAs were slightly adjusted by realigning the first sequence (query) with all other sequences (considered as a block) using MAFFT (Katoh and Standley, 2013).
When building reduced coMSA40 from the readjusted PPI4DOCK coMSAs, coMSAs that already had under 40 sequences before the hhfilter step were not filtered.
The 10 sequences in coMSA10 were selected from coMSA40 as follows: Euclidian division was performed of the number of sequences in the coMSAs40 (including the query) over 10 with q and r, the quotient and remainder of this division. Starting from the first sequence, the next sequence is selected every q+1 for the first r steps, then every q until the end, including the last sequence resulting in 11 sequences with the first being the query and other 10, the homolog sequences.
5.1.5 Threading models
The PPI4DOCK benchmark contains docking targets based on unbound homology models of pairs of binding partners for which an experimental complex structure is available. The use of homology modelling for unbound partners enables to expand the benchmark, by alleviating the need to identify complexes for which experimental structures of the interface and the exact two binding partners have been solved. This makes the benchmark larger, but as a counterpart, in PPI4DOCK the unbound structures used for docking are themselves, homology models.
In a docking context where we know the structures of the unbound partners, we would build homology models for all sequences in the coMSA by using the two query structures as modelling templates. However, since in PPI4DOCK the unbound query structures are themselves homology models, this would mean building a model by using a homology model as a template, and we felt this succession of modelling steps would lead to a loss in model precision. Therefore, the templates used for threading coMSA sequences were the unbound templates used to build the PPI4DOCK unbound models.
Template protein sequences were directly extracted from their structures and aligned onto the coMSAs using MAFFT (sequence-profile alignment) (Katoh and Standley, 2013) from which the pairwise homolog-template alignments were directly extracted. coMSAs were stripped down to positions that were covered by the query sequence. In order to ensure that the template structure exactly matched the template sequences in the stripped pairwise alignments, both template sequences were re-aligned using clustalw (Larkin, et al., 2007), and identified irrelevant residues in the template structure were removed.
Threading implies that the side-chains of our homologs are mapped very conservatively onto the query template structure.
5.2 Supplementary results
5.2.1 Supplementary tables
We try to improve the baseline consensus performance by incorporating Rosetta’s physics-based interface score (ISC) (section 2.2). As Rosetta scoring is more computationally expensive than the other two scores (about 750 times slower than SOAP-PP and InterEvScore calculations), we score only the top 1,000 decoys (as ranked by FRODOCK2.1) with ISC. This score is denoted ISC/1k as opposed to IES/10k and SPP/10k. As such, ISC is individually less well-performing than the other scores in terms of top 10 success rate, even when InterEvScore and SOAP-PP are computed only on the top 1,000 FRODOCK2.1 decoys (supplementary Table 5-10). However, the top 50 success rate is higher for ISC/1k than for any other individual score, except for SOAP-PP calculated on 10,000 decoys (supplementary Table 5-10). Despite this, integrating the top 50 decoys ranked by ISC/1k with the top 50 of the other three scores into a four-way consensus, denoted Cons4, slightly degrades performance compared to Cons3 (supplementary Table 5-10) while strongly increasing computation time.
5.2.2 Supplementary figures
Acknowledgements
Benchmarking was done partly through granted access to the HPC resources of CCRT under the allocations 2018-7078 and 2019-7078 by GENCI (Grand Equipement National de Calcul Intensif). We thank Arnaud Martel for his help with setting up the data web page.
Footnotes
Funding This work was supported by Agence Nationale de la Recherche [ANR‐15‐CE11‐0008 to R.G., ANR-18-CE45-0005 to J.A]; IDEX Paris-Saclay [IDI 2017 to C.Q.]; MINECO [BFU2016-76220-P to P.C.]; and AEI/FEDER, UE [PID2019-109041GB-C21 to P.C.].