Discovering Secondary Protein Structures via Local Euler Curvature

Protein structure analysis and classification, which is fundamental for predicting protein function, still poses formidable challenges in the fields of molecular biology, mathematics, physics and computer science. In the present work we exploit recent advances in computational topology to define a new intrinsic unsupervised topological fingerprint for proteins. These fingerprints, computed via Local Euler Curvature (LECs), identify secondary protein structures, such as Helices and Sheets, by capturing their distinctive topological signatures. Using an extensive protein residue database, the proposed computational framework not only distinguishes between structural classes via unsupervised clustering but also achieves remarkable accuracy in classifying proteins structures through supervised machine learning classifier. We also show that the internal structure of LEC space embeds the information about the secondary structure of proteins. Beyond its immediate implications for the advancement of critical application areas such as drug design and biotechnology, our approach opens a fascinating avenue towards characterizing the multiscale structures of diverse biopolymers based solely on their geometric and topological attributes.

and Lasker basic medical research awards for predicting (with circa 95% accuracy) protein's 3D structures from their amino-acid sequences.6][17][18] .Consequently, a parallel line of research has been exploring more mathematical approaches based on geometry and topology.
This has been enabled by recent advances in computational topology that provide efficient algorithms to synthesise highly complex nonlinear high-dimensional data as geometrical objects ('shapes'), which satisfy properties (called Topological invariants) remaining unchanged under continuous deformations (e.g.stretching, twisting, or bending) [19][20][21][22][23][24] .With this approach, data is represented via Simplicial Complexes (SC) 25 , which generalize the notion of triangulation of a surface and are constructed by gluing together simplices (i.e. points, line segments, triangles and their higher dimensional counterparts).The gluing satisfy certain nested conditions (i.e. higher simplexes must contain lower simplexes).The gluing or reconstruction of the data typically follows a combinatorial process (e.g.Simplicial Filtration, sometimes referred as Filtration cutoff ) endowed with a distance function, which then constructs a family (indexed by distance) of simplicial complexes (i.e.growing sequences of meshes).
Every simplicial complex in this sequence is then studied via topological invariants (e.g.peristent Barcodes, Euler-Poincaré characteristics etc).Applications of computational topology to the study of proteins include, protein structures analysis [26][27][28] , molecular interactions in the context of molecular simulations [29][30][31] , and also hybrid approaches combining machine learning methods with computational topology [32][33][34] .Noteworthy, some hybrid approaches have achieved 85% accuracy in protein classification 32 and notably the MathDL hybrid algorithm has won the D3R Grand Challenge 4 35 .Despite these advances, a direct classification of protein structures from their topology is still an open problem.Indeed, to date, most computational topological approaches employ Barcode as the main topological invariant since it is complete 36 and is endowed with stability conditions 37 .
However, the Space of all barcodes is not a vector space and as a consequence is not directly amenable to analysis (e.g.statistical analysis), although there are ongoing theoretical developments in this direction (e.g.Persistent landscapes 38 , barcode stratification via Coxeter complexes 39 ).In short, barcodes contain excessive information, does not lend itself to analysis and interpretation of results, which leads to difficulties in protein classifica-88 tion.

89
To advance, we take the advantage of a key mathemat-90 ical property of the Euler characteristics (EC), which is 91 that it can be decomposed (or partitioned), leading to the 92 concept of Local Euler Curvature, or in this work some-93 times called also Local Euler Characteristics, (LEC) [40][41][42] .94 LEC lends itself to protein analysis since local properties 95 of proteins are crucial.Overall, this leads to a more 96 efficient compartmentalised analysis and interpretation, 97 while still connecting to the global topological properties 98 of proteins.Specifically, we will show that the basic 99 secondary structure of proteins are naturally described 100 by LEC.To validate this, we analysed a representative 101 database containing millions of protein residues, and sub-102 sequently modelled and scored using LEC representation.103 Importantly, we show for the first time, that the struc-104 tures of proteins can be precisely classified based only 105 on their topology, using a simple methodology based on 106 LEC representation.We organise this sequel as follows; 107 we first reconstruct a sequence of simplicial complexes 108 (i.e. via filtration) and compute their corresponding LEC 109 for a specific protein, which provides an overview of the 110 computational pipeline.We then apply the LEC in the 111 context of an ensemble of proteins to maximise the map-112 ping of secondary structures and compare with previous 113 classification schemes.These fundamental findings is 114 then used in a unsupervised learning framework to show 115 how secondary structures can be mapped and classified 116 via LEC.Moreover, we employ a Random Forest Classi-117 fier model to predict the proteins secondary structures 118 relying solely on their reduced backbone and respective 119 topology.Finally, we conclude this study with a discus-120 sion section, which also outlines future perspectives.

122
As a brief summary of what results are to be expected: 123 We will first show in Fig. 1 that LEC of protein residues 124 naturally describes the traditional secondary structure 125 assignment from standard methodologies.Then, these 126 qualitative results will be generalized to an ensemble of 127 protein structures (see Fig. 2), and the individual LEC 128 signatures of 7 secondary structure classes will be charac-129 terized by their LEC profiles.Once the robust statistics 130 of our aggregation of classes into different LEC filtration 131 have been established, we will reinterpret the filtration 132 process as a topological feature space, which will be used 133 for learning the protein structures in an unsupervised 134 manner.Additionally, we will train a Random Forest 135 Classifier model and subsequently its final scores in com-136 parison to reference methodologies will be analysed in 137 Table 1.3b) of the 0-simplices and 1-simplices, where the colors represent LEC (Eq.4) of each residue accordingly to respective colorbars.(g) LEC profiles for all residue of 2GB1, where we also indicate by solid horizontal lines the 3 cutoffs 3.1, 4.6, 6.7Å.The LEC profiles were re-scaled between [-1:1] to improve the contrast.

Local Euler Curvature of Proteins 139
We start our exploratory work by studying a specific 140 example, namely the protein 2GB1, which will guide the 141 reminder of our study on an ensemble of proteins.We 142 first illustrate, in Fig. 1a, the conventional process of  4.6 Å, and 6.7 Å, respectively, which also illustrates the combinatorial explosion of edges that emerge as a function of increase cutoff.Moreover, for every generated SC, we compute its corresponding Euler Characteristics and Local Euler Curvature (see Methods).The LEC can be computed per atom of the protein, however this leads to a fine-grained computation, which is unnecessary for residue-based protein structural classification.Thus, we restrict to the backbone network, comprising PDB atoms CA, C, N of each protein residue.Hereafter, the network used to compute the simplical simplex is formed by its restricted backbone and the LEC of a residue is given by PDB atoms CA, C and N (see Methods).Such computations is shown in Fig. 1c, which depicts that along the filtration process (within the cutoff range 1.5 to 7.5 Å), for every generated SC, the corresponding LEC associ-our approach from the preceding section, we computed   To delve deeper, we employed standard software to cate-224 gorize residues into these seven classes, computing the individual averages for each.To this end, we exclusively considered residues with matching DSSP and STRIDE assignment, denoted as the consensus classification.
The results presented in Fig. 2c-2f, show the LEC averages and standard deviations after the consensus classification of CATH residues.Fig. 2c displays the average LEC across the seven secondary structure categories within the [1.5, 5.5] Å cutoff range.One key finding is a distinctive gap between extended structures ( -Bridge, -Strand, and Coils) and highly turned structures (↵), including ↵-Helix, 3 10 -Helix, ⇡-Helix, and Turns, emerged just beyond the C ↵ i -C ↵ i+1 distance at around 4.6 Å.The differentiation between all classes was further evident in the vicinity of 3.1Å, notably highlighting the distinctiveness of ↵-Helix and -Strand structures, as previously seen in Figs.1d-1f.Extending the analysis up to a 7.5 Å cutoff (Figs.2d-2f), yields a finer differentiation across all seven categories.
In particular, the structures display a LEC plateau after 4.0 Å in Fig. 1d.The similarity between -Strand and -Bridge persist until 5.0 Å, after which -Bridge exhibit smaller oscillations than the characteristic -Strand oscillations, as indicated by the reduced standard deviation, including a gap around 6.7 Å.
In contrast, the ↵ structures in Fig. 2e showcase various patterns.While the three helix types share similar LEC profiles, they do not feature the plateau characteristic of structures.Divergence among ↵ structures initiate at 5.0 Å, where 3 10 -Helix and ⇡-Helix demonstrate predominantly positive LEC values between 5.0 and 6.0 Å, while ↵-Helix exhibit negative values.Subsequently, beyond 6.0 Å, ↵-Helix acquire positive values, while the other two exhibit negative values.Notably, the relative small standard deviation of LEC demonstrates remarkable homogeneity across a wide range of cutoff distances.
The behaviour of the remaining classes, Turns and Coils, is illustrated in Fig. 2f.Interestingly, they exhibit intermediate characteristics, with Turns resembling 3 10 -Helix trajectories, while Coils mostly resemble -Bridge structures.However, both categories display minimal oscillations beyond 6.0 Å, distinguishing them from what is shown on Figs.2d and 2e.
The correlation between typical distances, such as ↵-Helix, pitch and -Strand interchain distances (Fig. 1), with the trends described in the previous sections, highlights the utility of our methodology in capturing plateaus and oscillations within distinct ranges, enhancing its versatility.
Up to this point, our results suggest that we can effectively identify ↵ and structures in the scenario depicted in Fig. 1.Furthermore, we can provide detailed and nuanced descriptions for all seven structural categories as illustrated in Fig. 2. Henceforth, we shall show that the observed LEC features are sufficient to classify protein

Accuracy of LEC representation 328
The final findings we report here, focuses on evaluating 329 how well our Random Forest Classifier, (denoted rbLEC ) First, it is important to gauge the accuracy of DSSP and STRIDE themselves, considering that they assign different classifications.We find that they have an accuracy of 76% for the CATH dataset (including non-consensus residues) and 72% for the test set.In contrast, our rbLEC model was trained exclusively on the consensus classification, omitting residues when DSSP and STRIDE do not agree.Thus, a comparable (possibly similar) scores between our rbLEC classifier and the DSSP/STRIDE is to be expected.Indeed, Table 1 shows that our rbLEC achieves a test set accuracy of 79% and 74% concerning DSSP and STRIDE, respectively, while CATH dataset yields 81% and 71%.Moreover, our rbLEC model provides class probabilities for a given feature.This allows us to compute the Top K = 2 accuracy, which is considered successful if the first or second most probable class matches the reference classification labels.We obtained scores of 90% and 87% for the test and CATH datasets, respectively.This remarkable result highlights the potential effectiveness of rbLEC and our training approach in providing a robust residue classifier for seven classes, based solely on the LEC of a given protein chain's three backbone atoms.Notably, our methodology requires no ad-hoc determinations of hydrogen bond networks or protein geometric backbone geometries.
Additionally, Table 1 provides the f1-score for the four primary classes (encompassing over 95% of available residues): -Strand, ↵-Helix, Turns, and Coils.This score provides more information about the precision and recall balance of predicted classes.Interestingly, our rbLEC classifier yields f1-scores comparable to those of DSSP/STRIDE, corroborating our findings in the preceding sections.

Discussion
Our model's accuracy, as well as its performance in identifying the top two candidates, as shown in Table 1, is truly encouraging.To gain further insights and understand potential limitations, we turn to the confusion matrix presented in Table SI 2. This matrix, which predominantly features diagonal elements, suggests that our model excels in correctly classifying structures.However, it is worth noting that -Bridge and ⇡-Helix are the primary sources of relative errors.
Next, we dive into the interpretability of our methodology.We focus on -Bridge structures, which are often misclassified as -Strand, or Coils.From a structural and geometrical standpoint, these misclassifications are quite understandable.A closer examination of Fig. 2d and 2f reveals significant similarities in their LEC profiles, with noticeable differences only emerging after the 5.0 Å mark.In reality, finding perfectly flawless -Bridge conformations, as seen in isolated structures, is a rare occurrence in densely packed proteins.residue counts.This innovative approach highlights the unique aspects of our method compared to traditional classification methods.

What truly sets our methodology apart is its ability
From our analysis, it is evident that Coils and Turns structures exhibit transitional features in their LEC profiles between helices and sheets, see We anticipate that a detailed examination of these categories will provide novel insights about protein secondary structures, shedding light onto cases where the classification systems do not align and potentially revealing interesting subtypes.
Assessing ⇡-Helix structures poses its own set of challenges due to limited consensus classifications in our test set (only 5 samples, see Table SI  In summary, the Euler characteristics (a global topolog-430 ical invariance) and its associated Local Euler Curvature 431 (a local geometric measure) provides a powerful mathe-432 matical and computational tool to dissect in a consistent 433 way the local and global properties associated to the 434 topological representation of proteins.Specifically, we 435 find that the Local Euler Curvature provides a novel 436 tool to classify protein's secondary structures, which 437 matches the state-of-the-art, however, without making 438 strong geometrical assumptions.We envisage that our 439 methodology will be useful to understand not only pro-440 tein structures but also protein interactions in the context 441 of disease models.For instance, this could be applied 442 to Alzheimer's Disease where it has been found that 443 amyloid-beta peptides have anti-microbial role and may 444 interact with glyco-proteins of various pathogens [44][45][46][47][48] .445 Moreover, we envisage that extensions of our proposed 446 methodology will in the long-term contribute to the am-447 bitious goal of predicting the 3D structure-functional 448 relationship of proteins.For the testing dataset, we acquired 22 proteins with 459 a total of 17,875 residues from RCSB PDB. 50To ensure 460 fairness, we selected a diverse range of complete proteins, 461 encompassing both small (161 residues) and large (3797 462 residues) structures.This set encompasses prevalent 463 tertiary structure groups such as ↵ + , ↵/ , all-↵, and 464 all-, including notable representatives like beta barrels, 465 membrane proteins, and enzymes.Those are detailed in 466 Table SI 1.

467
Before processing and performing the analysis, we 468

473
Associating simplicial complexes (and indeed a topologi-474 cal space) to a protein structure is a fundamental step.

475
There are numerous ways to achieve this but herein we 476 employ simplicial filtration method to construct the so 477 called Vietoris-Rips complex, 52,53 (or equivalently clique 478 complex).For completeness, we provide the basic con-479 cepts and specifically how we implement it.A simplicial 480 complex (SC), K, is a subset of the power set 2 V of the 481 set of nodes (vertices) V (in the sense of graph theory), 482 with elements of K denoted faces (f ) and K endowed 483 with the hereditary property: given an element f 2 K 484 and another element f 0 ✓ f , then f 0 2 K. Thus, given We chose to project the data to the first 5 PCs as the gains in explained variance ratio rapidly decreased after this point.For clustering, we fit a series of Gaussian Mixture Models (GMMs).GMMs represent a dataset as a mixture of multiple Gaussian distributions with (possibly) different parameters, allowing the model to capture more complex patterns than, for example, a kmeans based method.We fit GMMs with n clusters, n 2 {2, 3, 4, 5, 6, 7, 8, 9, 10} directly to the 5-dimensional projection effected by PCA.
We tested the robustness of the clustering using a variety of measures.The Jensen-Shannon metric 55 quantifies the dissimilarity or divergence between two probability distributions.As GMMs are essentially probability distributions, the Jensen-Shannon metric can be used to measure the similarity of GMMs.By fitting differing GMMs on subsets of the data and measuring the similarity of the models, we can test the stability of the clustering.
Similarly, we can measure the homogeneity and completeness of GMMs trained on different subsets of the data.Homogeneity measures the degree to which each cluster contains only data points that are members of a single class.A high homogeneity score indicates that the clusters are consistent with the data categories, which is desirable in clustering.Completeness quantifies the degree to which all data points that are members of a particular class are assigned to the same cluster.In essence, it evaluates whether all data points of a given class are correctly clustered together.A high completeness score indicates that the clustering algorithm successfully captures all data points from the same category within a single cluster.By comparing the homogeneity and completeness of clusters of one GMM to the classes assigned by another, we can test how much much the cluster predictions would change when starting with a different sample class.See

Random Forest Classifier
Our ensemble of random trees was created by using Scikitlearn 54 (sklearn) with 100 trees.They were fitted using a balanced class-weight and an out-of-bag strategy was used to assess the accuracy of the predictors.The tree minimum amount of samples to split a leaf was exhaustively explored using a nested resampling strategy.The cross-validation used an inner and outer resampling of 5 folds, both resamplings were stratified and shuffled.For each fit, the accuracy, balanced accuracy, f1-weighted and f1-macro scores were collected and they are shown in Fig. SI 13a.Subsequently, the classification model was trained using the CATH data set of consensus classification, using a minimum of 100 samples to split each leaf.The final validation scores, from the test data set, is also

Fig. 1 .
Fig. 1.Representations of protein PDB 2GB1.(a) We highlight the relevant distances.Typical approximate distances 3.6 Å for (53C-54C) backbone length and (5CA-16CA) 5.2 Å and (31CB-34CB) 5.6 Å for -Strand interchain separation and ↵-Helix, respectively.The colors purple/orange/grey represent the standard DSSP assignments of -Strand, ↵-Helix and Turns/Coils, respectively.(b) Graph representation showing the nodes in blue and the edges in grey.(c)LEC representation, where grey dots are the computed Local Euler Curvature as a function of the cutoff distance of all residues from 2GB1 and the blue line is its average.(d)-(f ) Representation at different important filtration cutoffs 3.1, 4.6 and 6.7 Å(also indicated in Fig.3b) of the 0-simplices and 1-simplices, where the colors represent LEC (Eq.4) of each residue accordingly to respective colorbars.(g) LEC profiles for all residue of 2GB1, where we also indicate by solid horizontal lines the 3 cutoffs 3.1, 4.6, 6.7Å.The LEC profiles were re-scaled between [-1:1] to improve the contrast.

143
assigning secondary structures to the protein 2GB1 using 144 the DSSP algorithm.We then proceed via a computa-145 tional combinatorial topology approach where a discrete 146 topological space is associated to the 2GB1 protein struc-147 ture via a Simplicial Filtration approach (see Methods).

148
Such a construction leads to the simplicial complex (SC) 149 depicted in Fig.1b, for a fixed filtration cutoff parameter.150 However, Fig. 1b can be pruned to accommodate a spe-151 cific maximum or minimum edge distance, hence yielding 152 distinct SCs (i.e.filtration).For instance, Figs.1d-1f 153 showcase the respective SC for cutoff distances of 3.1 Å, 154

196
the simplices and their corresponding local curvatures 197 (specifically Eq. (1)), which were then aggregated to yield 198 the LEC (Eq.(4)) for each individual residue.Our data 199 aggregation encompassed a total of 2,215,533 residues.200 Proceeding with the Simplicial Filtration approach by 201 varying the cutoff distance from 1.5 to 7.5 Å we observe a 202 combinatorial explosion in the emergence of simplices as 203 depicted in Fig. 2a.Indeed, note the exponential growth, 204 with an average of approximately 10 9 simplices in dimen-205 sions 4 and 5 at a 7.5 Å cutoff.Within this cutoff range, 206 the highest dimensional simplices were of dimension 14 207 and the most occurring ones were of dimensions 4, 5, and 208 6.

209
Fig. 1c and those in Fig. 2b is discernible.Despite the 215 223

Fig. 2 .Fig. 3 .
Fig. 2. Topology of the consensus residues from CATH database.(a) Average number of simplices (#simplices) of all molecules vs filtration cutoff.Higher than 13-simplices were not found in this range.(b)-(f ) LEC computed using (Eq.(4)) from all consensus residues.The average and standard deviations are represented by the solid line and filled curves, respectively.In b we show the average over all consensus residues, while the indicated colours in (c) to (f ) represent different consensus secondary structures.Vertical lines show relevant distances, where the bold line represents typical C i ↵ C i+1 ↵ distance, and dotted lines are the distances highlighted in Fig. 3b.

330(
see Methods), performs when compared to traditional 331 methods for assigning protein secondary structures.Ta-332 ble 1 displays the results in terms of f1-score and accuracy 333 when compared to DSSP and STRIDE classifications.
Fig. SI 11b.Moreover, helices and sheets form distinct clusters when Coils and Turns are not considered.For other numbers of clusters, however, some clustering metrics markedly decrease and the Gaussians assigned by the GMM indicate higher variability.This may be due to the fact that a high number of new distinct cluster structures emerge in the data (see Fig. SI 12) that do not combine in a reliable way.Nevertheless, these clearly defined clusters open new interesting avenues of investigation.
1).Adding to the complexity are significant differences between the DSSP and STRIDE methodologies.From a geometric perspective, ⇡-Helix structures represent helices with an expanded radius, often nested within ↵-Helix, motifs.This arrangement introduces interruptions in the continuity of helices, potentially carrying significant functional implications for the protein.To gain deeper insights, future studies should explore these intertwined ↵-Helix structures using a more extensive dataset to establish a more robust consensus on ⇡-Helix classifications.Interestingly, our classifier demonstrates more consistent classification compared to DSSP than STRIDE, suggesting a potentially more uniform approach from DSSP authors theirs parameter choices.

451
Our dataset comprises 3D protein structures.The 452 training set includes structures sourced from the lat-453 est version of the CATH database,43,49 which catego-454 rizes protein structures from the Protein Data Bank 455 into non-redundant domains.This compilation of non-456 redundant protein domains contains 14,938 files, repre-457 senting 2,200,595 residues. 458 Fig. SI 9 for results.The consensus categories and full CATH categories gave broadly similar results.The metrics and clustering of the H set was markedly differentiated; see Fig. SI 10.

Article Discovering Secondary Protein Structures Rodrigo A. Moreira et al.Table 1 .
Final scores of our data sets.The number of residues used was 17,875 at final assessment test set and 2,215,533 for our CATH data set.f1-score is defined by f 1 = 2(precision ⇤ recall)/(precision + recall).