Detection of native and mirror protein structures based on Ramachandran plot analysis by interpretable machine learning models

In this contribution the discrimination between native and mirror models of proteins according to their chirality is tackled based on the structural protein information. This information is contained in the Ramachandran plots of the protein models. We provide an approach to classify those plots by means of an interpretable machine learning classifier - the Generalized Matrix Learning Vector Quantizer. Applying this tool, we are able to distinguish with high accuracy between mirror and native structures just evaluating the Ramachandran plots. The classifier model provides additional information regarding the importance of regions, e.g. α-helices and β-strands, to discriminate the structures precisely. This importance weighting differs for several considered protein classes.


Introduction
Kurczynska and Kotulska tried in their work [28] to distinguish between native 27 and mirror models in an automated way (Automated Mirror-Native-Discrimination - 28 AMND). For this purpose, they used the energy terms obtained by PyRosetta, a 29 Python-based interface to Rosetta for molecular modeling [9]. The discrimination 30 results of 69% in average are respectable but far away to be satisfying. Further, it was 31 not possible for them to find one uniform rule based on that energy terms for all the 32 proteins classes A-G of SCOPe [18,31], which were studied in this investigation. Albeit 33 in [2] it becomes apparent that the calculation of the energy terms is rather complex. 34 Alternatively, they also took into account to consider the differences within 35 Ramachandran plots (R-plots), and thus the dihedral angles distribution of a protein's 36 backbone by depicting them in a toroidal plot, because these plots are known to provide 37 structural information [43]. However, they explained that these plots are not feasible for 38 automated mirror/native discrimination. 39 Nonetheless, there are three things a R-plot definitely can give information about: a 40 protein's secondary structure elements, the favored/allowed regions of those and also 41 the handedness of helices (right-handed and left-handed) [3,48]. Following the 42 International Union of Pure and Applied Chemistry's (IUPAC) definition of handedness 43 being the same as chirality [37], the distinctive feature for differentiating between native 44 and mirror models would be the handedness of helices. 45 Since this handedness can also be found as a characteristic of R-plots, they represent a 46 simple but important tool for finding structural differences in proteins [3]. Assuming 47 that native and mirror proteins display different structures the resulting differences in 48 the R-plots should be observable. 49 To tackle the AMND problem, appropriate mathematical and statistical tools are 50 required, which reflect the mathematical structure of the R-plots and can evaluate them 51 in an automatic way with sufficient certainty. Therefore machine learning tools like 52 (deep) artificial neural networks come into play [13], which are powerful tools to analyze 53 complex data like R-plots. However, these networks and their inferences are frequently 54 not explainable [45]. An alternative to deep networks provide interpretable machine 55 learning approaches like prototype based models [5,56], which rely on the nearest 56 prototype principle for data representation [38]. Frequently, these networks make use of 57 specific data properties [58] and, hence, their results are better to interpret and may 58 provide additional knowledge drawn from the training data. 59 In this context, a key observation regarding R-plots is that we can take them as 60 approximated density plots or histograms of dihedral angle pairs (Φ, Ψ) in the 61 2/14 two-dimensional plane. Thus, the data analysis and statistical evaluation have to deal 62 with the comparison of densities and their discrete representations. Thus, we apply an 63 adaptive vector quantization based classifier model -the so-called Generalized Matrix 64 Learning Vector Quantizer (GMLVQ) [19,50]. After training this classifier provides 65 structural knowledge regarding the data, which supports the classification decision. In 66 particular, a classification correlation matrix is delivered, which describes correlations 67 between data [5,56]. GMLVQ is known to be robust and easy to interpret [46,58]. In 68 biomedical context it was successfully applied to analyse flow cytometry data and to 69 detect early folding residues during protein folding [4,7].

70
The structure of this paper is as follows: First the data set in use is described more 71 detailed as well as the corresponding data preprocessing. Afterwards we give a brief 72 introduction of R-plots as well as the machine learning data analysis tool GMLVQ, 73 which is an interpretable artificial neural network. Subsequently, we state the general 74 workflow for the AMND to distinguish between native and mirror samples based on 75 R-plot analysis by means of GMLVQ.

76
Finally we present the numerical results for classification performance and the extracted 77 knowledge provided by the interpretable model together with its direct biological 78 explanation.

81
Ramachandran plots (R-plots) display the dihedral angles Φ and Ψ of a protein's 82 backbone to visualize their distribution [43]. R-plots provide an easily inspectable tool 83 to detect underlying properties of the secondary structure in that protein [3]. However 84 it has to be kept in mind, that the measured angles describe the current state at the 85 time of measurement of the respective dynamic atoms in the backbone [33]. T ∈ R n with n = N 2 , where x k ≥ 0 is the (relative) number of dihedral 90 pairs in the cell with the coordinates (j, l) where k = (j − 1) · N + l. Thus, x is a 91 relative histogram vector also denoted as probability vector, i.e.  For example cells (1,4) and (2,4) as well as cells (4,1) and (4,2) characterize the favored 99 regions for right-handed α-helices (α) and left-handed α-helices (L α ), respectively. The 100 cells (1,2) and (2,1) constitute the favored regions for β-strands (β). The Generalized Learning Vector Quantizer (GLVQ, [47]) is a prototype-based machine 104 learning method for classification derived from a heuristic approach proposed by T.
Kohonen [22]. The mathematically well justified model optimizes the classification 106 performance by means of stochastic gradient descent learning (SGDL, [14,44]) and is 107 known to be robust and interpretable . The underlying cost function to be minimized 108 during training approximates the overall classification error for a given training data set 109 . . , C} is a training data class label out of a set of available class 111 labels [19]. For this purpose, the GLVQ model assumes a set of prototypes 112 W = {w k ∈ R n , k = 1 . . . M } with class labels c (w k ) ∈ C such that at least one 113 prototype is assigned to each class. Thus, a partition of the prototype set Further, a dissimilarity measure d (x, w) is supposed to evaluate the similarity between 117 data and prototypes. This measure is required to be differentiable at least with respect 118 to the second argument to ensure SGDL.

119
For a given GLVQ-configuration, i.e. a fixed prototype set W , a new data point x is 120 classified by the assignment known as winner-takes-all in nearest prototype classification [38]. The prototype w ω is 123 denoted as winner of the competition.

124
During the learning, GLVQ adapts the prototypes minimizing the overall cost function by SGDL with respect to the prototypes realizing a competitive learning scheme [1], is the local classification error and #T denotes the cardinality of the training data set. 127 The local error E (x k , W ) depends on the choice of the monotonically increasing 128 squashing function f and the classifier function is the so-called best matching 130 correct prototype and is the corresponding best matching For a detailed consideration we refer to [20]. 139

4/14
In standard GLVQ, the squared Another popular choice is the squared Euclidean mapping distance proposed in [50] depending on the mapping matrix Ω ∈ R m×n . Here, m is the mapping 142 dimension usually chosen as m ≤ n [8]. Thus, the data are first mapped linearly by Ω 143 and afterwards the Euclidean distance is calculated in the mapping space R m .

144
The SGDL for training performs an attraction-repelling scheme in the data space as the 145 common feature for all LVQ-schemes [23]: For a given training data x j with known class 146 label c (x j ), the best matching correct prototype w + is moved a trifle towards x j 147 whereas the best matching incorrect prototype w − is slightly repelled. Iterative 148 application of that scheme with random training data selection realizes SGDL for the Interestingly, the mapping matrix Ω can also be optimized by SGDL to achieve a better 151 separation of the classes in the mapping space [51]. Note that SGDL for Ω-optimization 152 usually requires a careful regularization technique [49]. The respective algorithm is 153 known as Generalized Matrix LVQ (GMLVQ) [50].  Further, this approach allows to detect outliers due to the bounded minimum distance 167 principle in contrast to class assignments via decision hyperplanes [12]. Accordingly, 168 outliers are rejected because the model validity cannot guaranteed for these data 169 (outlier-reject). Further, data x with uncertain classification decision can be rejected 170 evaluating the difference between the distance d x, w ω(W ) to the overall winner w ω(W ) 171 and the distance d (x, ω (W )) with W = W \ W c(w ω(W )) . In case of a significant 172 deviation, which corresponds to an uncertain decision, the query regarding x is rejected 173 (classification-reject).

174
The Data Set 175 We used exactly the same data set as given in [28], which was generated and analyzed 176 there to detect native and mirror structures. Therefore, we give only a short summary 177 of the data generation and preprocessing, for a more detailed description it is referred 178 to [28]: 179 In order to emulate the necessary procedures in structure modeling from contact maps, 180 the authors derived these maps [17] from a set of 1, 305 representative domains of 181

5/14
SCOPe superfamilies. Those maps were in turn the base for reconstructing full-atom 182 proteins of different configurations [24,25,29,54], approximately 50 native and 50 mirror 183 models for each domain. Consequently the whole data set comprises 130, 500 models, 184 from which the dihedral angles were calculated [10].

185
Further all of the models in the data set can be grouped into one of the classes A-G of 186 SCOPe, representing seven distinctive data sets. Class A (all α) describes proteins that 187 are predominantly made of α-helices, whereas class B (all β) mainly consists of  Each of the aforementioned data sets determines a specific learning task to distinguish 198 mirror and native samples. To do so a relative R-plot histogram vector x ∈ R n with 199 n = N × N with N = 6 as grid resolution was extracted for each sample. These data 200 vectors served as training data for GMLVQ according to the considered tasks.

201
For each learning task A-G, and ALL we trained a separate GMLVQ model with three 202 prototypes per class (mirror/native). Additionally to the prototypes we also adapted 203 the mapping matrix Ω with the mapping dimension m = 36. The reported classification 204 results are achieved as the averaged test performances obtained by 50 independent runs. 205 Each run was done as a five-fold cross-validation procedure.

206
Further, for visual inspection and evaluation, we generated a summarized R-plot, which 207 is collecting all pairs (Φ, Ψ) of dihedral angles for all samples of the considered task in a 208 single R-plot but separately for native and mirror samples. Thus, these R-plots can be 209 seen as estimated dihedral angles densities in the (Φ, Ψ)-plane for mirror and native

218
As previous publications suggest [28,55], native and mirror conformations of proteins class B, which obviously collides with the statement of those native and mirror models 226 being indistinguishable [28,55] and, furthermore, even exceeds the accuracy of class A. 227 The relevant features for class discrimination are those corresponding to β-strands in 228 the R-plot (see S3 Fig). In detail, the important underlying secondary structures in this 229 case might be the right-handed triple helices (collagen) and parallel β-strands [26].

230
However, the confirmation of the actual underlying secondary structure as well as its 231 relation to chirality are still pending.

232
As protein classes C and D structurally show a combination of the aforementioned two 233 classes all-alpha and all-beta, the relevant features for class discrimination also do. Class 234 C shows the best of all investigated accuracies. This result concurs with the findings 235 in [28]. Among the protein classes which were not categorized due to their secondary 236 structure, the multi-domain class E by far shows the best accuracy with 91.8%, whereas 237 classes F and G do not exceed 80%. Even though class E has got such a high accuracy 238 it has to be treated with care, since there has not been enough data for this protein 239 class. The relatively poor accuracy for class F is most likely due to the fact that 240 membrane proteins propose some difficulties in structure elucidation [34,42,64] and this 241 results in low resolutions [36]. A poor resolution in turn may lead to inaccurate atom 242 coordinates. That means, the calculations of the dihedral angles cannot be correct either 243 and, therefore, complicate the classification. As for class G the obtained low accuracy is 244 probably due to the fact that small proteins do not have that many amino acids, less 245 than 100 [35,53], and therefore show less α-helices or β-strands than other proteins.

246
The R-plots for classes E-G together with the corresponding relevance profiles are 247 depicted in S3 Fig .   248 In order to assess the models suitability for a more general problem we considered all 249 protein classes for training as well as for testing and achieved an overall accuracy of 250 88.09%. Pursuing this approach, which is more general and more considerable for   [55]. However, we were able to show that a discrimination 268 of native and mirror models using structural features is indeed possible.

269
The GMLVQ classifier achieves high separation accuracies for all protein classes except 270 class F and G. At least for the latter one, acceptable results are obtained. In fact, the 271 resulted accuracies for protein classes F -G show that a distinction of mirror and 272 native structures by means of R-plots is possible with high precision and sensitivity.

273
The interpretable model offers additional insights: In particular, the relevance profiles, 274 weighting the regions like α-helices and β-strands of R-plots for 275 mirror-native-discrimination, differ for the considered protein classes. The obtained 276 relevance profiles are in good agreement with respective biological knowledge about 277 protein structure chirality. at least for the considered data set.

278
Thus, the presented approach offers a successful alternative to the statistical approach 279 based on energy levels as proposed in [28] and emphasize the importance of R-plots for 280 structural analysis of proteins as already mentioned in [3]. Along this line, also data 281 processing is easier in the present approach compared to the complex calculations of the 282 energy levels [2].