Loss-functions matter, on optimizing score functions for the estimation of protein models accuracy

Motivation Methods for protein structure prediction (PSP) generate multiple alternative structural models (aka decoys). Thus, supervised learning methods for the evaluation and ranking of these models are crucial elements of PSP. Supervised learning involves optimization of loss functions, but their influence on performance is typically overlooked. Here we put the loss functions in the spotlight, and study their effect on prediction performance. Results Here we report the performances of three variants of MESHI-score, a supervised learning method for the estimation of model accuracy (EMA). Each variant was trained with a different loss function and showed better performance in different aspects of the EMA problem. Most importantly, better discrimination between models of the same target, is gained by target centered loss functions. Availability All data is available at http://meshi1.cs.bgu.ac.il/SidiAndKeasar2018Data_download/. The MESHI-package (version 9.412) is available at https://github.com/meshiprot/meshi/releases). Contact chen.keasar@gmail.com


Introduction
Knowledge of protein structures is essential to the understanding of biological processes, and to the design of experimental work in molecular and structural biology. The ultimate sources of this knowledge are experimental methods for structure determination, most notably: X-ray crystallography, NMR and electron microscopy. These methods however are expensive and time consuming, which necessitate complementary computational protein structure prediction (PSP) methods.
PSP methods typically generate many alternative structures , henceforth referred to as decoys (Gront et al., 2012;Jaroszewski et al., 2011;Keasar and Levitt, 2003;Park and Levitt, 1996;Rohl et al., 2004;Söding et al., 2005). Thus, in order to provide the most promising decoys, PSP methods must include some mechanism for the estimation of model accuracy (EMA -aka quality assessment, QA). EMA was recognized as a key aspect of PSP early on (Sippl, 1993), and is still the focus of much research (Elofsson et al., 2017). Since 2008, a dedicated track in the CASP series of PSP experiments measures progress in EMA, and the last rounds have witnessed quite a few impressive results (Kryshtafovych et al., 2017). Yet, the CASP results also indicate that the problem is still open, and none of the current methods is able to consistently pick the best decoy, or accurately predict decoy qualities.
From a methodological perspective, current EMA methods fall into three broad categories: knowledge-based potentials that provide statistical inference from known native structures of proteins (Olechnovič and Venclovas, 2017); unsupervised learning methods that seek consensus within groups of decoys, implicitly assuming random distribution around the native state (McGuffin, 2008;Skwark and Elofsson, 2013;Wallner and Elofsson, 2005); and supervised methods that train on sets of quality-labeled decoys. To this end, supervised methods represent either protein residues or whole proteins as feature vectors and optimize a statistical model to predict protein quality (Manavalan et al., 2014Mirzaei et al., 2016;Ray et al., 2012).
The supervised approach is the most general one, as any of the other quality estimate may serve as a feature. Further, it can re-train the statistical model when new sets of labeled decoys become available (e.g., at the end of a CASP experiment). Unfortunately, the adaptation of supervised learning to the EMA problem is none-trivial. Typical supervised learning techniques assume that the data is sampled (at least approximately) from independent and identical distribution (iid). EMA techniques cannot assume that, since the EMA problem is inherently groupcentered. The estimation of decoy accuracies is meaningful only in the context of their target. Specifically, biologists studying a particular protein, of unknown structure, aim to select the best among decoys of that protein. All these decoys share the same sequence, which distinct them from decoys of other proteins (Figure 1). This deviation from iid distribution of both training and testing data of EMA cannot be ignored. A naïve approach of pooling all data, and blindly applying off-the-shelf tools results in severe overfitting (data not shown). Indeed, all contemporary methods address this challenge by splitting the data into target-disjoint training and test sets, which is compatible with the natural partition of solved and unsolved proteins. The developers of ProQ series of methods (Ray et al., 2012;Uziela et al., 2016) took two steps further: they (a) reduced biases in their training database by randomly choosing the same number of decoys from each target; and (b) they also smoothed the composition differences between targets by applying their machine learning at the residue level, which may be closer to iid.
A major implication of the deviation from iid in EMA is incompatibility between minimization of the error in accuracy predictions over the entire dataset and per-target measures such as loss -the price of missing the best decoy (see below). They coincide only at the apparently unreachable limit of zero error. Yet, methods that use generic supervised learning tools can optimize per-target performance only indirectly by minimizing global error. Qui and his co-workers addressed this difficulty by associating each target of the training set with a binary feature (Larsson et al., 2008). This allowed an arbitrary uniform shift in all the decoy scores of the target during training. Test targets were not associated with such features, resulting in a score that only relate to relative quality. We are not aware of later studies that followed this interesting approach. MESHI-Score (Elofsson et al., 2017;Keasar et al., 2018;Mirzaei et al., 2016) is a supervised learning method for EMA, that trains its statistical model by Monte-Carlo Simulated Annealing (MCSA) optimization (Brooks and Morgan, 1995;Kirkpatrick, 1984;Metropolis and Ulam, 1949). MCSA is a very flexible computational approach, and is specifically permissive towards the nature of the loss-function that guides optimization. In the context of EMA, it allows either global loss-functions that consider the entire set of decoys, or target-based loss-functions that explicitly consider the group-centric nature of EMA.
Aiming at both absolute and relative accuracy prediction, earlier versions of MESHI-Score used loss-functions that combined a penalty on per-target mean error and a reward for higher Pearson correlation. The utility of the resulted score functions has been demonstrated in CASP11 and CASP12 experiments (Elofsson et al., 2017;Keasar et al., 2018). The current study relaxes the original goal, and seek specialized score functions for either absolute or relative accuracy. Here we report the performances of three MESHI-Score versions. One of them was trained with a global loss-function, and provide the best absolute accuracy estimate. The other two were trained with target-based functions and provide the best relative accuracy estimates. For the sake of completeness, we demonstrate that MESHI-Score is comparable to state-of-the-art EMA method, ProQ3.

General
MESHI-Score is an ensemble learning technique for EMA, which aims at predicting the quality of decoy structures (measured in gdt_ts, Zemla, 2003). The overall pipeline of MESHI-score was presented in the previous publications (Elofsson et al., 2017;Mirzaei et al., 2016). In a nutshell, it includes four consecutive steps ( Figure

Database
We extracted and curated a new non-redundant database of decoys from the CASP repository 1 , slightly modifying our previous procedure, by removing the older CASP8 targets, adding newer CASP12 targets and adding a repacking step (see below). The database includes 305 fullchain targets from CASP9-12, with a total of 73605 decoys. Redundant targets, which are homologous to more recent ones, were removed. To this end, we applied all-against-all BLAST (Johnson et al., 2008) comparison to all targets. We considered targets as homologous if either (a) E-value < 10E-6; or (b) E-value is between 10E-6 and 10E-4 and their structures are similar by visual inspection. A standardization procedure reduces structural noise by side-chain repacking using scwrl4 (Krivov et al., 2009) followed by tethered energy minimization (avg. of absolute gdt_ts = 0.007), using the OPTIMIZE program of MESHI-package (version 9.412). 1 http://predictioncenter.org At the end of the tethered minimization step the OPTIMIZE program calculates a vector of 129 features that characterize the decoy. The set o feature vectors constitutes the feature database used to train and tes MESHI-score. These features include: (1) pairwise potentials adopte from the literature: GOAP (Zhou and Skolnick, 2011), KB-PMF (Sum ma and Levitt, 2007), ramp (Samudrala and Moult, 1998), and scwrl4 energy; (2) in-house developed structural features: torsion angle term (Amir et al., 2008), quadratic bond and angle terms, radius of gyration atom environment, and hydrogen bonds (Levy-Moonshine et al., 2009) and (3) compatibility with sequence based prediction of secondary struc ture and solvent accessibility. These predictions were performed usin PSI-PRED (McGuffin et al., 2000) and DeepCNF (Wang et al., 2016) with uniref90 (last updated in 30/7/2017) (Suzek et al., 2015).

Performance measures
All performance measures presented in this manuscript relate to th ability of EMA scores to reproduce decoy accuracies as represented b gdt_ts. Gdt_ts values range between zero (unrelated structures) to on (very similar). Specifically, we use several per-target measures, (Mirzae et al., 2016), and one global (i.e., applied to all the database decoys): (1) MRMSE -Median per-target root mean square of the predictio error, which is the absolute difference between the decoy' predicted quality and its gdt_ts score.
(2) MC P and MC S -Median per-target Pearson and Spearman Cor relation's respectively, between the predicted and observe qualities.
(3) ML -Median per-target loss, the absolute difference betwee the gdt_ts scores of the top scoring decoy, which is picked b the method, and the actual best decoy in the group.
(4) ME10 and ME5 -Median of the per-target enrichment of th and 5% highest accuracy decoys, within the 10% and 5% top scoring ones, respectively. That is, any ME value above on entails better-than-random performance and an ideal predicto reaches ME10 and ME5 values of up to 10 and 20 respectively.
(5) GC -Global Spearman correlation between observed and pre dicted qualities of all decoys

Cross-validation and statistical analysis
To compare the performances of different scores we apply five round of 5-fold cross validation. In each round the dataset is randomly part tioned to five sub-sets, each of which serves as the test set once. Overal we get 5X5=25 values per performance measures, and the medians o these values are reported.
While the essence of this manuscript is the comparison of differen versions of MESHI-score, we also calculated the EMA score of the state of-art method ProQ3 (Uziela et al., 2016). ProQ3 is publically-availabl and we downloaded and ran it using default parameters. Specifically, th ProQ3 values were not generated within the cross validation scheme, an thus their performance is not directly comparable to those of MESHI score.

Individual predictors
The individual predictors are the only components in MESHI-Scor that undergo training. A single predictor is a non-linear combination o features (nlfc -eq.1). To ensure diversity of the predictors, a key ingred  (Brown et al., 2005;Sollich and Krogh, 1996), each predictor is associated with a fixed sigmoid function S (eq.2), and is trained to predict the sigmoid transform of decoys' gdt_ts. Thus, each predictor learns to predict a slightly different objective. To provide a meaningful prediction, the predictor uses the inverse function The weight of the prediction, ܹ (Eq.4), indicates the sensitivity of the predictor near the predicted value; , such that 1 α 9 and 0 β 1 are randomly sampled, with even probability.

Training of score functions
In MESHI-score, individual predictors learn the parameters of their statistical model through Monte-Carlo Simulated Annealing (MCSA) optimization of a loss function Stochastic optimization methods, like MCSA may optimize any arbitrary loss function, resulting in diverse scores. Here we compare the score functions that resulted from the optimization of three loss functions: ( Given a training set of targets T, and Θ the set of allowed ߠ combinations (section 2.5), the state of a predictor is ሺ ‫ܮ‬ , ߠ ሻ , where ‫ܮ‬ ‫ؿ‬ ܶ and ߠ ‫א‬ ߆ , the weights vector is derived from the predictors state by averaging the per-target coefficients of linear regression models. The ߠ and vectors are then fed to the score function ݊ ݈ ݂ ܿ ൫ ݂ Ԧ ൯ (section 2.5), and the state's performance, over all the training set T, is evaluated using the loss function. Monte-Carlo steps replace predictor states by a neighboring one (i.e, replacing one of ‫ܮ‬ 's targets).

The ensemble score
In this study, we used ensembles of thousand predictors. Given a feature vector ݂ Ԧ , each predictor outputs a pair . The ensemble score is the weighted median of all the thousand predictions.

Results and discussion
MESHI-Score trains its individual predictors by MCSA optimization of a loss-function, which provides much flexibility in score development.
To study the effect of the loss-function on different performance aspects, we experimented with global and target-based loss functions (data not shown). Here, we systematically compare the three best-performing (to date) versions of MESHI-Score. One of them, MESHI-Score-globalcorrelation (GC), was trained using a global loss-function ‫ݏݏܮ(‬ ି ௦ section 2.7) The other two, MESHI-Score-mediancorrelation (MC) and MESHI-Score-median-enrichment (ME), were trained using target-centered loss functions ‫ݏݏܮ(‬ and ‫ܮ‬ ‫‬ ‫ݏ‬ ‫ݏ‬ ଵ section 2.7). In order to compare the three scores, we performed a 5X5fold cross validation test. In each round of the test, 4/5 of the targets served as training set, and the three trained scores predicted the decoy accuracies of the remaining fifth. Table 1 presents the average per-target performances of the scores, over six performance measures. Figure 3 provides a more detailed view of the per-target distributions of these performances.   Score-global-correlation, MESHI-Score-median-correlation, and MESHI-Score-median-enrichment respectively. MC S , MC P , MRMSE, ME5, ME10, and ML refer to median: Spearman correlation, Pearson correlation, root mean square of error, 5% enrichment, 10% enrichment, and loss, respectively. The two segments of each box indicate the second and third quantiles of the population Overall the three scores have similar performances, which are on-par with the current state of the art (Table 2). Yet, the differences between the global score and the two target-centered scores are statistically significant and telling. Most notably, the different measures are not consistent with one another. The best prediction of mean absolute accuracy, is achieved by the global score (GC), yet it performs worse in estimating the relative accuracies of decoys within their group. The target-centered scores on the other hand are more likely to pick one of the best decoy available, and less likely to pick one of the least accurate decoys ( Figure  4 and Figure 5).

Figure 4. The distribution of differences in loss values between
MESHI-Score-median-enrichment (ME) and MESHI-Score-globalcorrelation (GC). Five GC/ME pairs are generated by 5X5 crossvalidation test for each database target. The histogram depicts the distribution of their loss value differences. The apparent skew towards positive (red) differences indicates that ME is more likely to result lower losses than GC. Similar trends are observed also for the other measures of Table  1, and Wilcoxon signed rank test indicate that their skews are statistically significant (Wilcoxon signed rank test; P<0.01).

Conclusions
The EMA problem is a long standing hard computational problem, which gets complicated by its group-centered nature. Further, it is a multi-objective problem as we seek both absolute and relative estimates of accuracy. Our results suggest that the loss-function, which is used to optimize the parameters of an EMA score, has a profound impact on the score's performance, and that different loss-functions may excel in different objectives. Specifically, target-based loss-functions may be beneficial for target-based objectives (e.g., loss). One may speculate that this insight may be applicable to other machine-learning tasks, especially ones in which the standard assumption of iid sampling of the data is not applicable.