Abstract
Protein-protein interactions play a ubiquitous role in biological function. Knowledge of the three-dimensional (3D) structures of the complexes they form is essential for understanding the structural basis of those interactions and how they orchestrate key cellular processes. Computational docking has become an indispensable alternative to the expensive and timeconsuming experimental approaches for determining 3D structures of protein complexes. Despite recent progress, identifying near-native models from a large set of conformations sampled by docking - the so-called scoring problem - still has considerable room for improvement.
We present here MetaScore, a new machine-learning based approach to improve the scoring of docked conformations. MetaScore utilizes a random forest (RF) classifier trained to distinguish near-native from non-native conformations using a rich set of features extracted from the respective protein-protein interfaces. These include physico-chemical properties, energy terms, interaction propensity-based features, geometric properties, interface topology features, evolutionary conservation and also scores produced by traditional scoring functions (SFs). MetaScore scores docked conformations by simply averaging of the score produced by the RF classifier with that produced by any traditional SF. We demonstrate that (i) MetaScore consistently outperforms each of nine traditional SFs included in this work in terms of success rate and hit rate evaluated over the top 10 predicted conformations; (ii) An ensemble method, MetaScore-Ensemble, that combines 10 variants of MetaScore obtained by combining the RF score with each of the traditional SFs outperforms each of the MetaScore variants. We conclude that the performance of traditional SFs can be improved upon by judiciously leveraging machine-learning.
1. Introduction
Proteins are among the most abundant, structurally diverse and functionally versatile biological macromolecules. They come in many sizes and shapes and perform a wide range of structural, enzymatic, transport, and signaling functions in cells[1]. But proteins rarely act alone as their functions are typically mediated by interactions with other molecules, including in particular, other proteins. Alterations in protein-protein interfaces leading to abnormal interactions with endogenous proteins, proteins from pathogens or both, are associated with many human diseases[2]. Protein interfaces have therefore become some of the most popular targets for rational drug design[3–5]. However, the development of effective therapeutic agents[6–9] to inhibit aberrant protein interactions requires detailed understanding of the structural, biophysical, and biochemical characteristics of protein-protein interfaces. The most reliable source of such information comes from X-ray crystallography[10] and nuclear magnetic resonance (NMR), which identify interfaces at the atomic level; alanine scanning mutagenesis, which identifies interfaces at the residue level; mass spectrometry-based approaches, e.g., chemical cross-linking and hydrogen/deuterium (H/D) exchange, which identify individual interfacial residues[11, 12]; NMR-based approaches[13], e.g., chemical shift perturbations, cross-saturation, and H/D exchange, which determine interfaces at the residue or atomic level[14] and cryo-electron microscopy (cryo-EM) which can directly image large macromolecular complexes in their native hydrated state[15]. However, because of the technical challenges and the high costs and efforts involved, there is still a large gap between the number of known protein-protein interactions and the availability of 3D structures for those[16]. Therefore, there is an urgent need for reliable computational approaches for predicting protein-protein interfaces and complexes.
Against this background, computational docking has emerged as a powerful tool for modelling 3D structures of protein–protein complexes[17]. Given 3D structures or models of putative protein-protein interaction partners, docking aims to generate 3D models of their complex. Docking involves two key steps: sampling of the interaction space between the protein molecules to generate docked models; and scoring of the docked conformations to distinguish near-native conformations from the sampled conformations. There has been much recent progress on both sampling as well as scoring[18, 19].
The scoring functions that have been developed for protein-protein docking can be broadly grouped into several categories[20]: 1) Physics-based scoring functions that typically consist of a linear combination of energy terms. Examples include those used in HADDOCK[21], pyDOCK[22], RosettaDock[23], ZRANK[24], IRAD[25], DFIRE[26], DFIRE2[27], PISA[28], and SWARMDOCK[29]; 2) Statistical potential-based scoring functions such as 3D-Dock[30], DFIRE[26, 27], DECK[31], SIPPER[32], and MJ3H[33] which typically convert distancedependent pairwise atom-atom or residue-residue contacts distributions into potentials; 3) Complementarity e.g., of shape, energy, or physico-chemical characteristics[34–38], 4) Interface connectivity based scoring functions[39, 40]; 5) Evolutionary conservation based scoring functions, e.g., InterEvScore[41]; and 6) Machine learning based scoring functions that combine a wide range of features including residue propensity of interfaces, contact frequencies of residue pairs, evolutionary conservation, shape complementarity, energy terms, atom pair distance distributions, etc.[42–52] However, as evident from the results of recent CAPRI competitions[53], there is considerable room for improvement in both sampling and scoring[17, 54–56].
Against this background, we introduce MetaScore, an approach to scoring docking conformations that combines any existing scoring function with a random forest[57] (RF) classifier trained to discriminate between near native and non-native structures. The RF classifier utilizes a variety of features of the interface between the proteins in the docked conformation, including interaction propensity-based, physico-chemical, energy-based, geometric, connectivity-based, and evolutionary conservation features. We report results of experiments on a standard benchmark, the protein-protein docking benchmark version 5.0[58] (BM5), which show that MetaScore outperforms the original scoring function when the two are compared using the area under the curve of success rate (ASR) and area under the curve of hit rate (AHR) for the top 10 predicted conformations. We further describe an ensemble method, MetaScore-Ensemble, that combines the score produced by an RF classifier trained using features including scores of several traditional scoring methods and features of interfaces with the averaged score of the original scoring methods. This ensemble approach even outperforms MetaScore using any single original scoring method. We conclude that machine learning methods can complement traditional approaches to scoring docking conformations.
2. Materials and Methods
2.1. Training data set and preprocessing
We used the protein-protein docking benchmark version 4.0 (BM4)[59], which has both the bound and unbound structures of protein-protein complexes, for training in our experiments excluding antigen-antibody complexes and non-dimers. For each of the remaining (cases), decoy models (BM4 decoy set) were generated by HADDOCK running in ab initio mode using center of mass restraints following its standard three-stage docking protocol: rigid body docking, semiflexible refinement, and water-refinement[60]. We then selected cases and their water-refined decoys using the following criteria: (1) A case has at least one decoy with acceptable or better quality (i.e., interface root mean squared deviation (i-RMSD) of the decoy is less than or equal to 4Å)1; (2) The number of interface residues in a conformation is greater than or equal to 10. Interfacial residues are determined using an alpha carbon-alpha carbon (CA-CA) distance of 8Å between two residues belonging to two different proteins in the conformation (a decoy or a bound form). Among the 176 cases in BM4, 63 cases with decoys HADDOCK generated and 45 cases with only bound structures remained. We labeled a decoy near-native if its i-RMSD relative to the bound form is less than or equal to 4Å. Otherwise, the decoy was labeled as nonnative. This process yielded 1,221 near-native and 35,957 non-native conformations. We refer to this set as the BM4 decoy set. However, the proportion of near-native and non-native conformations is highly unbalanced. Hence, we further under-sampled the non-native conformations for each case so that the near-native to non-native ratio is 1:1 (after testing 1:1, 1:2, 1:4 and 1:8 using 10 fold case-wise cross-validation on the BM4 decoy set, data not shown). We chose non-native decoys whose i-RMSDs are greater than 14 Å for training a model (after searching and testing 4, 8, 14, and 18 Å as cutoffs, data not shown). Our final training set consists of 1,221 near-native models (i-RMSD ≤ 4Å) and 1,221 non-native models (i-RMSD > 14Å) for 108 cases.
2.2. Test data set and preprocessing
For independent testing, we used sets of decoys generated by HADDOCK from the 55 newly added docking cases to the BM5 [58] (BM5 decoy set) and sets of decoys from CAPRI competitions between CAPRI 10 and CAPRI 30 excluding non-dimers (CAPRI score set)[53]. The CAPRI score set consists of decoys generated from different docking programs, which can represent an ideal set for validating scoring functions independently of the docking programs. The decoys and cases from the BM5 decoy set and CAPRI score set were filtered to the same process as that applied to the training data, BM4 decoy set. The resulting numbers of cases for BM5 decoy set and CAPRI score set are 9 and 17, respectively. The corresponding numbers for decoys were 216 near-native and 3,384 non-native conformations and 1,115 near-native and 3,485 non-native conformations for BM5 decoy set and CAPRI score set, respectively.
2.3. Comparison with State-of-the-Art Scoring Methods
We used 10 different state-of-the-art scoring functions to test the MetaScore approach: HADDOCK[21], iScore[52], DFIRE[26], DFIRE2[27], MJ3H[33], PISA[28], pyDOCK[22], SIPPER[32], SWARMDOCK[29], and TOBI’s method (TOBI)[61]. Among them, HADDOCK, DFIRE2, PISA, pyDock, SWARMDOCK, and TOBI are physicochemical energy-based scoring functions. SIPPER and MJ3H are statistical potential-based functions. DFIRE is a function based on both physicochemical energy and statistical potential. iScore is a machine learning-based scoring function using a random walk graph kernel.
Both iScore and MetaScore rely on machine learning. However, unlike MetaScore which uses various features of interfaces of native and non-native protein-protein conformations to train classifiers that discriminate between native and non-native conformations, iScore utilizes node labeled graphs to incorporate the details of interfaces. Furthermore, MetaScore is an ensemble technique which can be applied to any combination of scoring functions, including iScore.
2.4. Evaluation Metrics
The performance of a scoring method to correctly rank decoys based on i-RMSD was evaluated using two metrics: The success rate (the percentage of cases that have at least one near-native conformation among the top N conformations) and the hit rate (the overall percentage of nearnative conformations that are included among the top N conformations). Both were calculated for an increasing number of predictions N varying between 1 and 400. For easier comparisons, area under the curve of success rate (ASR) and area under the curve of hit rate (AHR) were computed from the plots of corresponding success rate and hit rate respectively for N between 1 and 400 predictions. We focus on curves of ASRs and AHRs for the top 10 and top 400 predictions because top 10 decoys are considered for further analysis in the biologists’ perspective[44] and CAPRI[56] competitions also allow them to be submitted for the next evaluation, and because 400 are the total number of decoys HADDOCK generally generates at its final stage for a case. All metrics are normalized between 0 and 1.
2.5. MetaScore, a novel approach combining scores from machine learning classifier based scoring function with scores from a traditional scoring function
MetaScore is an approach that combines the random forest (RF) based score produced from our RF classifier trained using several features with the score from a traditional scoring function.
2.5.1. The RF classifier
We trained an RF classifier using a diverse set of features of the interfaces between the interacting partners in decoys of our training data set to discriminate between near-native and non-native conformations. Random forest (RF) is an ensemble tree-structured classifier which is used for a data set with a large number of training data points and input features[57]. A random forest has two hyperparameters, ntrees (the number of trees to grow) and mtry (the number of features randomly selected as candidates at each split in a tree). They were optimized using a grid search approach; the value of ntrees was set from 10 to 500 with a step length of 10 and the value of mtry was set from 1 to 28 with a step length of 3. The hyperparameter optimization accompanies every RF model trained in different situations such as training with different feature sets, combining with different traditional scoring methods, and so on. The trained RF classifier outputs a probability for a decoy being non-native. The lower an RF score for a decoy, the more likely it to be near-native according to the RF classifier.
2.5.2. The Min-Max normalization within each case
Before combining the scores from different scoring functions including the RF score, we normalized the scores of decoys for each case from each scoring function using the Min-Max normalization method. Min-Max normalization scales a list of data from 0 to 1. The minimum value in the data is mapped to 0 and the maximum one in the data is mapped to 1. The strength of this method is that all relationships among the data values can be preserved exactly and that any potential bias is not introduced into the data[62]. However, the Min-Max normalization is vulnerable to outliers in the original data, e.g., scores of decoys which have clashes. The resulting normalized values may fluctuate with existence of outliers in the data set. Before applying the Min-Max normalization, we defined values that fall outside two standard deviations of the mean in the data (here, scores of decoys within a case from a scoring method) as outliers. We forced outliers in the upper side of the data to be assigned 1 and those in the lower side to be assigned 0 as a normalized value. Then, we applied the Min-Max normalization into the remaining original data.
A normalized value (z) for x in a set of decoy scores for a case, X, using this method is calculated as follows: where min(X) and max(X) are the minimum and maximum values in the X given its range excluding outliers.
2.5.3. The final score of MetaScore
The final score is obtained by simply averaging the normalized scores of a decoy from the different scoring methods.
2.6. Features of MetaScore
We used seven types of features to encode protein-protein interfaces, each of which has been shown to be useful for characterizing properties of protein-protein interface residues[63, 64]. We extracted the following features for the binding site formed by the interacting partners in each decoy: i) Raw and normalized scores from each scoring function (Score features), ii) Evolutionary features, iii) Interaction propensity based features (Statistical features), iv) Hydrophobicity (Physicochemical feature), v) Energy-based features, vi) Geometric features, and vii) Connectivity features (see below for detail). A decoy is represented by a feature vector formed by its corresponding features.
2.6.1. Raw and normalized scores from each scoring function (Score features)
We included the raw scores and the normalized scores from each scoring function as part of MetaScore features, which are called Score features. Because different methods produce scores in different ranges, and even the scores assigned by a single method to decoys from different docking cases are in general incomparable, there is a need to normalize the scores. We applied Min-Max normalization method to normalize scores of decoys in each case for each scoring method. Contrary to the normalized score, the original scores from a classical scoring function also contain valuable information such as the size of interface region[65], the scoring function’s expertise on how to combine its own multiple features related to binding process and so on. Therefore, it is expected that a combination of original scores and normalized scores can play roles as complementing each other on training a model. We therefore decided to use both original scores and normalized scores.
2.6.2. Evolutionary features
Binding sites tend to be highly conserved across species[64, 66, 67]. A scoring function that ranks decoys based on the degree to which their binding sites match the known or predicted binding sites of the target complex produces rankings that tend to place near-native conformations above non-native ones[52, 68]. Therefore, evolutionary conservation scores of interfacial residues in the binding sites are expected to contribute to classifying decoys into near-native decoys or non-native models.
We used Position-Specific Scoring Matrix Information Contents (PSSM-ICs) of interfacial residues as conservation scores. PSSM-IC is a measure of the information content for a residue in a PSSM based on Shannon’s uncertainty using prior residue probability and relative frequency of the residue at a specific protein sequence position[69]. The higher a value of PSSM-IC of a residue, the more conserved the residue is. The PSSM-ICs are calculated from a result of multiple sequence alignment using PSI-BLAST[70]. We ran PSI-BLAST of BLAST 2.7.1+ against NCBI nr database (as of February 4, 2018) to retrieve the sequence homologs of each protein sequence using 3 iterations of PSI-BLAST with an e-value cutoff of 0.0001. Based on the length of the protein sequence, we automatically set “query length-specific” parameters, e.g., BLAST substitution matrix, word size, gap open cost and gap extend cost, according to a guideline provided in NCBI BLAST user manual (https://www.ncbi.nlm.nih.gov/books/NBK279684/) (see Supplementary Table S1). We collected PSSM-ICs for only interfacial residues between the interacting partners for each decoy and aggregated the PSSM-ICs into three types of representative values: average, minimum, and maximum of the PSSM-ICs for each and both of two proteins in a decoy. In total, 9 features were generated.
2.6.3. Interaction propensity-based features (Statistical features)
Previous studies[30, 31, 71,73] have shown that pair-wise amino acid interaction propensities provide useful information about interaction patterns of amino acids in complexes. We utilized interaction propensities of amino acid pairs in interfacial regions of protein-protein complexes, which were precomputed by InterEvScore[41]. The pre-calculated interaction propensities can be found in a supplementary table in the InterEvScore paper[41]. The interaction propensity of residue x and y, IP(x, y), was defined as the ratio of the observed frequency in the protein-protein complexes and the expected frequency derived as the random probability to pick the interaction pair of x and y.
Also, we assumed that interaction propensities weighted by conservation scores and/or distances between interfacial residue pairs can be promising features by reflecting evolutionary closeness and geometrical tightness into the interaction propensity. We generated two additional interaction propensity-based features weighted by only conservation scores (IPPSSM) and both conservation scores and distances between interfacial residue pairs (IPPSSM,Dist). For each interfacial residue pair (x, y) in a decoy (Di) which consists of protein A and B, IPPSSM and IPPSSM,Dist are defined as: where Dist(x, y) represents the distance between residue x in protein A and residue y in protein B, IP(x, y) represents the interaction propensity value for a pair of residue x and y that InterEvScore provides, and PSSMA(x, m) is the position-specific score corresponding to the value of the m-th amino acid in the 20-element vector for interfacial residue x in the PSSM profile from the sequence of protein A. All PSSM values were normalized by the sigmoid function.
Because the sizes of interfaces of different decoys are various, we summarized a list of values for each type of interaction propensity-based values (IP, IPPSSM, and IPPSSM,Dist) from interfacial residue pairs in a decoy by summation and averaging, which results in 6 features.
2.6.4. Hydrophobicity (Physicochemicalfeature)
Macromolecules’ physicochemical properties play important roles for the forces of attraction or repulsion among them. Among various physicochemical properties, hydrophobicity has been widely used in not only scoring of docked conformations but also predicting binding sites[74–77]. Additionally, the role of hydrophobicity in protein folding/unfolding and interactions has been well known[78–80]. We assigned hydrophobicity values of amino acids from the AAIndex[81] database into all interfacial residues of both proteins in a decoy and average them to use as a feature.
2.6.5. Energy-based features
We used the Van der Waals, electrostatic, and empirical desolvation energies calculated by HADDOCK for a decoy[82]. We adopted both normalized and raw values of the energy-based features. Using only raw values for training the RF model is unfair because the values assigned to decoys from different docking cases are incomparable. On the other hand, using only normalized values can cause loss of valuable information implied such as the size and the true net energy produced in the interface of each decoy. For each normalized energy feature, we applied the same Min-Max normalization method.
2.6.6. Geometric features
2.6.6.1. Shortest distances of interfacial residue pairs
We assumed that a near-native decoy should be a tightly bound form of the proteins and that decoys would have short and uniform distances of interfacial residues between two different proteins if the two proteins form a tight complex. Hence, we used the shortest distances of interfacial residue pairs as features to reflect principle of shape complementarity for a decoy. Distances between alpha carbon atoms of the two interfacial residue pairs in a decoy were computed and we selected the top 10 shortest distances. The lower the values are, the more compact the decoy.
2.6.6.2. Convexity-to-concavity ratio
The CX value measures the ratio of the volume of atoms that occupy within a sphere with a radius of 10Å to the volume of empty space in the sphere[83]. It has been widely used in previous studies as a protrusion index[63, 84]. The smaller a CX value, the more protrude the atom and its 10 Å neighborhood are. We assumed that if the alpha-carbon atoms of interfacial residues in a protein of a decoy protrude, the ones in their partner interfacial residues in another protein of the decoy would be dented in a compact decoy, and vice versa. In this light, higher convexity-to-concavity ratios using CX values for a pair of interfacial residues can indicate that either residue protrudes and the other one is dented. Keeping this in mind, we generated a feature, CXratio(x, y), modifying the equation to calculate the ratio of CX values of alpha-carbon atoms of each interfacial residue pair (x, y).
Let IA0, IA1,…, IAn denote a set of interfacial residues in a protein A of a decoy. Here, IAi where 1 ≤ i ≤ n is an interfacial residue in protein A, where n denotes the number of interfacial residues in the protein A. For each interfacial residue pair (IAi, IBj) of a decoy which consists of protein A and B, CXratio(IAi, IBj) is defined as: where CXAi and CXBj represent CX values calculated by centering the 10Å sphere on alphacarbon atoms of IAi and IBj, respectively.
CXratio(IAi, IBj) is larger than or equal to 1. The higher value of CXratio(IAi, IBj) can be regarded that the alpha-carbon atom of IAi or IBj protrudes and the alpha-carbon atom of another one is dented. The lower values of CXratio(IAi, IBj) can be considered that the both alpha-carbon atom of IAi and IBj protrude or are dented. Those CX-related values are obtained as many as the number of interfacial residue pairs in the decoy. We summarize them as forms of average and standard deviation, which ends up making a couple of features.
2.6.6.3. Buried surface area
The buried surface area (BSA)[82] is one of the HADDOCK-derived features. The BSA estimates the size of the interface between two proteins in a proteinprotein complex. It can be obtained by calculating the difference between then entire solvent accessible surface area of two unbound proteins and that of a decoy. We used this value as one of the geometric features for training our model. Because the ranges of BSA differ by cases, we normalized BSA values by apply the Min-Max normalization method described above, excluding outliers.
2.6.6.4. Relative accessible surface area
The relative accessible surface area (rASA) of each interfacial residue was calculated using both its solvent accessible area obtained using STRIDE[85] and the known surface area of the residue[86]. The average of rASA values of the interfacial residues was used as a feature for a decoy.
2.6.6.5. Secondary structure
It is well known that particular secondary structures are preferred at protein interfaces[87, 88]. To capture the tendency of protein secondary structures to occur in the interface regions, we counted how many times different secondary structures appear in interfacial residues of a decoy structure. We used 7 secondary structure categories; Alpha Helix, 3-10 Helix, PI-Helix, Extended Conformation, Isolated Bridge, Turn, and Coil. Using STRIDE[85], we counted the occurrence of each secondary structure and normalized the occurrence by dividing it by the number of interfacial residues. In total, 7 features of secondary structures for a decoy were generated.
2.6.7. Connectivity features
To capture the connectivity of interfacial residues and the size of interface, we added three features: The number of interfacial residue pairs, the total number of interfacial residues and the link density. The link density feature was implemented as defined in Basu et al.[89], which is a weighted number of interfacial residue pairs by the maximum number of possible links of interfacial residues between the two different proteins.
3. Results
3.1. Combination of scores from the RF classifier and scores from HADDOCK can improve the performance of HADDOCK scoring
To test our hypothesis that combining a machine learning model trained using potent interaction features with an existing scoring function can improve the performance of the original scoring function, we chose HADDOCK firstly as a representative of traditional scoring methods. We compared three scoring methods, HADDOCK, our RF classifier, and our MetaScore approach combining scores from HADDOCK and the RF classifier (MetaScore-HADDOCK) using 10 fold case-wise cross-validation with training set derived from BM4[59] (BM4 decoy set) and independent test procedures with sets of decoys from the newly added cases from BM5[58] (BM5 decoy set) and the CAPRI score set[53]. In the 10 fold case-wise cross-validation, a set of cases is randomly partitioned into 10 subsets. Of the 10 subsets, all decoys for cases in a single subset are retained as the test data and scored by a scoring method trained with decoys of cases from the remaining subsets. This process is repeated for all single subsets for testing in the crossvalidation.
Table 1 and Figs. 1–3 show that MetaScore-HADDOCK has better or at least comparable performance than the original method, HADDOCK, for all four performance metrics across all data sets we tested. The RF classifier itself, however, does not outperform HADDOCK for every data set and every evaluation method. Based on the observations, we conclude that the combination of scores from the RF classifier and HADDOCK could improve the scoring performance.
3.2. Evaluation of feature importance
To train our RF classifier, we used various types of features of protein-protein interfaces that describe the interaction characteristics between a pair of proteins. We evaluated their impact on the performance of the RF classifier using 10-fold case-wise cross-validation and excluding in turn each of the seven feature types (Table 2).
In the RF classifier, we found that the ASRs for top 10 and 400 predictions decreased for each feature type removed. Based on the ASR for top 10 predictions, which is a more focused evaluation metric for scoring methods, all feature types contribute to the performance of the RF classifier. Among the various types, Connectivity features is the feature type which contributes the best to the RF classifier but Evolutionary features is the least contributing feature type. Although the AHRs for top 10 and 400 predictions are not the best in the RF classifier using all features, the differences of the AHRs across most of the exclusion tests are insignificant in consideration of their standard deviation. We therefore determined to use the RF classifier using all features as our machine learning based model.
To see if feature combinations on training a machine learning model also affects the MetaScore-HADDOCK’s performance, we evaluated MetaScore-HADDOCK by excluding each type of features individually in the part of training the machine learning based model. Table 2 shows that the change of feature combinations has relatively little impact on the performance compared to the RF classifier based on the observation that standard deviations of the four performance measures in the MetaScore-HADDOCK are less than those in the RF classifier. Based on these results, we conjecture that combining scores from the two different scoring methods, the RF classifier and HADDOCK, helps reduce the change of the performance subject to changes among subsets of the entire feature set in the RF classifier. Although MetaScore-HADDOCK using all features does not show the best performance, we choose it as a final model because 1) difference of performance between the best performing MetaScore-HADDOCK which is trained without Evolutionary features and MetaScore-HADDOCK using all features is not statistically significant within standard deviation and 2) the RF classifier trained with all features has the best performance in terms of ASR, which is the more relevant evaluation metric for scoring functions. This is because we conjecture that the best performing RF classifier has higher chance of resulting in better MetaScore.
3.3. Combination of RF classifier scores and scores from other scoring methods can improve the performance of each method
To test if MetaScore approach can be applicable to other methods, not only HADDOCK, we performed the same procedure using 9 previously published scoring functions, iScore[52], DFIRE[26], DFIRE2[27], MJ3H[33], PISA[28], pyDOCK[22], SIPPER[32], SWARMDOCK[29], and TOBI’s method[61], respectively. We obtained scores from the 9 methods for decoys in our two data sets, BM4 decoy set and BM5 decoy set. For each scoring method, we replaced the normalized HADDOCK scores and the raw HADDOCK scores with the normalized scores and raw scores of the respective scoring methods, respectively and retrained our model with each set of scores. The resulting combined methods are called MetaScore-iScore, MetaScore-DFIRE, MetaScore-DFIRE2, MetaScore-MJ3H, MetaScore-PISA, MetaScore-pyDOCK, MetaScore-SIPPER, MetaScore-SWARMDOCK, and MetaScore-TOBI, respectively.
The results in Table 3 show that our MetaScore approach for most original scoring methods improves their performance for both the BM4 decoy set, our training set, using 10-fold case-wise cross-validation and the BM5 decoy set, the test set, in terms of ASR and AHR evaluated over the decoys ranked among the top 10 predictions except for AHR of DFIRE using BM5 decoy set. Also, even though results of three methods (iScore, PISA and MJ3H) using BM4 decoy set and five methods (HADDOCK, DFIRE, DFIRE2, MJ3H, and PISA) using BM5 decoy set do not show the improvement in MetaScore in terms of ASR and AHR evaluated for the top 400 decoys ranked, the performances of MetaScore and the original methods are comparable or the decrease in performance is marginal (less than 2.56%) in the independent testing procedure using BM5 decoy set. (Supplementary Figures S1-18).
These results indicate that our proposed method, MetaScore, using a combination of an RF classifier and an existing original scoring method is likely to improve the performance of the original method.
3.4. Many heads are better than one
Ensembles of multiple predictive models are known to often outperform individual models[90–92]. We incorporated the ensemble approach into MetaScore to obtain MetaScore-Ensemble which combines several previously published methods: HADDOCK[21], iScore[52], DFIRE[26], DFIRE2[27], MJ3H[33], PISA[28], pyDOCK[22], SIPPER[32], SWARMDOCK[29], and TOBI’s method[61], which is called “Expert Committee.” To examine how the performance of MetaScore-Ensemble varies as a function of the performance of members in the ensemble, we used three scoring method groups (Groups in Table 4): the higher performing group (ExpertsHigh), the lower performing group (ExpertsLow), and the members in the Expert Committee (Experts). ExpertsHigh and ExpertsLow were chosen based on the ASR and AHR for top 10 predictions obtained by 10 fold case-wise cross-validation using the BM4 decoy set, our training set. ExpertsHigh consists of HADDOCK, iScore, DFIRE, MJ3H, and PISA, and the ExpertsLow consists of the others. In addition, we used five ways of aggregating multiple scores (Approaches in Table 4), to see the combination effect on MetaScore-Ensemble against each scoring method group:
RF(Group), which is the RF classifier trained using only the raw scores and the normalized scores of members in a scoring method group (Group),
RF(Group + Features), which is the RF classifier trained using our feature set of the protein-protein interfaces including the raw scores and the normalized scores of members in a Group,
Avg(Group), which is a method averaging the normalized scores of members in a Group,
Semi-MetaScore-Group, which is a method combining the score from the RF classifier trained using only the raw scores and the normalized scores of members in a Group with the averaged score of the normalized scores of members in the Group,
MetaScore-Group, which is to combine the score from the RF classifier trained using our feature set of the protein-protein interfaces including the raw scores and the normalized scores of members in a Group with the averaged score of the normalized scores of members in the Group.
We tested fifteen MetaScore-Ensemble methods in total using combinations of three Groups and five Approaches (Table 4). For example, MetaScore-ExpertsHigh represents the one of MetaScore-Ensemble methods, which combines the score of the RF classifier trained using interaction features extracted from the protein-protein interfaces, the raw scores and the normalized scores of members in the ExpertsHigh Group with the averaged score of the normalized scores of the members in the ExpertsHigh Group.
The comparison results using ASR and AHR on our independent test set, BM5 decoy sets, are shown in Table 5. The curves of success rates and hit rates are shown in Fig. 4. We can observe that most of the MetaScore-Ensemble methods perform better than other scoring functions including single traditional methods and MetaScore variants, and that the MetaScore-Experts, which is the MetaScore-Ensemble method using MetaScore-Group Approach applied to the Experts Group, has the best performance in both ASR and AHR for top 10 predictions.
Moreover, Avg(Group) applied to three Groups (ExpertsHigh, ExpertsLow, and Experts) outperforms each members in each group. Regardless of which Group is used, the Avg(Group) is outperformed by RF(Group + Features), Semi-MetaScore-Group, and MetaScore-Group. Moreover, MetaScore-Group outperforms not only Semi-MetaScore-Group in every Group but also each of the MetaScore variants using each members in the corresponding Group. In addition, RF(Group + Features) which incorporates the features of interfaces for training the RF classifier outperforms RF(Group) which does not. Taken together, we can conclude that combining methods using any Approaches we tested except RF(Group) outperform individual methods, and that a machine learning model trained with additional features of interfacial regions outperforms a simple averaging method and a machine learning model not using features for interfacial regions in decoys.
Additionally, regardless of which one in the five Approaches is used, Approaches using the ExpertsHigh Group outperform ones using the ExpertsLow Group. In Avg(Group), Semi-MetaScore-Group and MetaScore-Group Approaches, use of the Experts Group outperforms use of either ExpertsHigh or ExpertsLow Group. Except for RF(Group) and Avg(Group), the Approaches using all members in the Experts Group were ranked in the top 5 methods in Table 5. As we expected, we observed that MetaScore-Ensemble methods which use better performing members can outperform ones that use less performing members, and that MetaScore-Ensemble methods using more members can perform better than using less members, except for MetaScore-Ensemble methods using only an RF classifier
4. Discussion
We have proposed a new approach, MetaScore, to rank docking models. The approach takes advantage of a machine learning-based classifier trained with widely used interaction features of interfacial regions to distinguish near-native conformations from non-native decoys. By simply averaging the score from the machine learning-based classifier trained using RF and the score from a traditional scoring method, we re-score the given models. By testing our approach on previously published scoring methods, we showed that the performance of the traditional scoring methods are improved.
When combining scores from two scoring methods, three scenarios in total can take place. First, by combining the scores from two scoring methods which both have good performance on scoring decoys, there is higher possibility of improving the ranking. Second, when the scores from a good and poor performing scoring methods are combined, incorrect ranking positions assigned by the latter one can be shifted closer toward correct positions by the better one. Third, if the two scores from two scoring methods are incorrect for ordering decoys, the combined score is still incorrect. Out of three cases, the first two are beneficial. Therefore, we can conclude that the main improvement of our approach comes from the synergistic/complementary effect.
Still, there is room for improvement of MetaScore in multiple perspectives. (1) Better performing machine learning-based classifiers could help MetaScore perform better. Better classifiers might be trained by using better combinations of different machine learning algorithms and/or different feature sets. Moreover, one of the attractive capabilities in machine learning algorithms is that they can manage a growing set of training data efficiently. Even though the set of training data contains low-quality data, several algorithms are able to handle the noise associated with the low quality of the data. Also, because a larger training set tends to improve the prediction power of a model, MetaScore is expected to easily evolve with the increasing size of data from a variety of docking softwares. (2) The use of more effective combining methods can be helpful for further improvement. We showed here that already the simple averaging method for combining scores from the RF classifier and a traditional scoring method in MetaScore improved the performance compared to the traditional scoring method. More sophisticated combination methods such as a linear combination using weighted terms might further improve the results. Various combination methods developed in other research fields could be applied to the scoring problem for protein-protein docking[90, 92].
5. Conclusions
We have shown that MetaScore, a combination strategy of an RF classifier and an original scoring method, leads to the improvement of the original scoring method. We conducted experiments 1) to establish feature sets for training an RF classifier, 2) to confirm that MetaScore can improve the performance of the original scoring method, 3) to see if the strategy of MetaScore applied with a group of several published scoring methods can lead to significant improvement. Our results highlight that MetaScore consistently outperforms each of the traditional scoring functions we tested, and that the consensus model built by MetaScore-Ensemble can always perform better than not only each of original scoring methods but also MetaScore in combination with any single method in terms of success rate and hit rate evaluated over the conformations ranked among the top 10 predictions. We believe that our approach will be useful not only to boost the performance of an existing single scoring method but also to develop a powerful scoring method by applying our strategy into a group of best performing scoring methods.
CRediT authorship contribution statement
Yong Jung: Investigation, Methodology, Formal analysis, Validation, Visualization, Writing original draft. Cunliang Geng: Investigation, Data curation, Writing - review & editing, Alexandre M. J. J. Bonvin: Funding acquisition, Resources, Writing - review & editing, Li C. Xue: Funding acquisition, Supervision, Writing - review & editing, Vasant G. Honavar: Funding acquisition, Resources, Project administration, Writing - review & editing, Supervision
Declarations of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to affect the study reported in this paper.
Acknowledgements
Y.J. was supported in part by a research assistantship funded by the Center for Big Data Analytics and Discovery Informatics at Pennsylvania State University. The work of V.H. was supported in part by the National Center for Advancing Translational Sciences, National Institutes of Health through the grant UL1 TR000127 and TR002014, by the National Science Foundation, through the grants 1518732, 1640834 and 1636795, the Pennsylvania State University’s Institute for Cyberscience and the Center for Big Data Analytics and Discovery Informatics, the Edward Frymoyer Endowed Professorship in Information Sciences and Technology at Pennsylvania State University and the Sudha Murty Distinguished Visiting Chair in Neurocomputing and Data Science funded by the Pratiksha Trust at the Indian Institute of Science. This work was also supported in part by the European H2020 e-Infrastructure grant BioExcel (grant no. 675728 and 823830) (A.M.J.J.B.). Financial support from the Netherlands Organisation for Scientific Research through an Accelerating Scientific Discovery (ASDI) from the Netherlands eScience Center (grant no. 027016G04) (L.X. and A.M.J.J.B.) and a Veni grant (grant no. 722.014.005) (L.X.) are acknowledged.
Footnotes
↵1 Even if there is no acceptable decoy for the case, the bound structure of the case is used for training. But such a case cannot be used for evaluation of the scoring method.
References
- [1].↵
- [2].↵
- [3].↵
- [4].
- [5].↵
- [6].↵
- [7].
- [8].
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].
- [36].
- [37].
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].
- [44].↵
- [45].
- [46].
- [47].
- [48].
- [49].
- [50].
- [51].
- [52].↵
- [53].↵
- [54].↵
- [55].
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].
- [73].↵
- [74].↵
- [75].
- [76].
- [77].↵
- [78].↵
- [79].
- [80].↵
- [81].↵
- [82].↵
- [83].↵
- [84].↵
- [85].↵
- [86].↵
- [87].↵
- [88].↵
- [89].↵
- [90].↵
- [91].
- [92].↵