Flexibility-aware graph-based algorithm improves antigen epitopes identification

Epitopes of an antigen are the surface residues in the spatial proximity that can be recognized by antibodies. Identifying such residues has shown promising potentiality in vaccine design, drug development and chemotherapy, thus attracting extensive endeavors. Although great efforts have been made, the epitope prediction performance is still unsatisfactory. One possible issue accounting to this poor performance could be the ignorance of structural flexibility of antigens. Flexibility is a natural characteristic of antigens, which has been widely reported. However, this property has never been used by existing models. To this end, we propose a novel flexibility-aware graph-based computational model to identify epitopes. Unlike existing graph-based approaches that take the static structures of antigens as input, we consider all possible variations of the side chains in graph construction. These flexibility-aware graphs, of which the edges are highly enriched, are further partitioned into subgraphs by using a graph clustering algorithm. These clusters are subsequently expanded into larger graphs for detecting overlapping residues between epitopes if exist. Finally, the expanded graphs are classified as epitopes or non-epitopes via a newly designed graph convolutional network. Experimental results show that our flexibility-aware model markedly outperforms existing approaches and promotes the F1-score to 0.656. Comparing to the state-of-the-art, our approach makes an increment of F1-score by 16.3%. Further in-depth analysis demonstrates that the flexibility-aware strategy contributes the most to the improvement. The source codes of the proposed model is freely available at https://github.com/lzhlab/epitope. Author summary Epitope prediction is helpful to many biomedical applications so that dozens of models have been proposed aiming at improving prediction efficiency and accuracy. However, the performances are still unsatisfactory due to its complicated nature, particularly the noteworthy flexible structures, which makes the precise prediction even more challenging. The existing approaches have overlooked the flexibility during model construction. To this end, we propose a graph model with flexibility heavily involved. Our model is mainly composed of three parts: i) flexibility-aware graph construction; ii) overlapping subgraph clustering; iii) graph convolutional network-based subgraph classification. Experimental results show that our newly proposed model markedly outperforms the existing best ones, making an increment of F1-score by 16.3%.


Introduction 1
A B-cell epitope is a specific region at the surface of an antigen that can be neutralized 2 by antibodies, and this neutralization can consequently elicit crucial immune 3 response [1]. Identification of epitopes can be useful to design vaccines, drugs, reagents 4 and so on [2][3][4][5]. Hence, intensive efforts have been made to develop epitope prediction 5 models, including experimental-based and computational-based approaches. 6 Experimental methods, such as X-ray crystallography, nuclear magnetic resonance and 7 phage display [6], are accurate but labor intensive as well as costly, while computational 8 models are more efficient and economical, but suffer from lower accuracy. Due to the 9 difficulty of improving experimental approaches and the fast-evolving computational 10 techniques, massive efforts have been made from the computational perspective. 11 A general protocol of the computational-based epitope prediction models is: i) 12 collecting antigen-antibody interaction complexes to obtain epitopes; ii) engineering 13 features of epitopes from chemo-physical and statistical perspectives, such as residue's 14 polarity [7], hydrophobicity [8], protrusion index [9], relative frequency [10], et al.;iii) 15 building computational models based on the features, such as machine learning 16 models [11], graph models [12], statistical models [13], et al.; iv) identifying new 17 epitopes via the well trained models. Although dozens of methods have been proposed, 18 the prediction accuracy still has a big room to improve. For instance, the F1-score of 19 the state-of-the-art epitope prediction model is still less than 0.6 [6]. One possible 20 reason could be the ignorance of structural flexibility of antigens. To our best 21 knowledge, all existing approaches, either experimental or computational, are based on 22 the static structures of antibody-antigen interacting complexes, that is, the structures of 23 antigens and antibodies are fixed when epitopes are determined. However, the structure 24 of an antigen can be flexible [14,15], particularly the side chain of the surface residues. 25 Therefore, incorporating flexibility into epitope prediction models could be helpful and 26 promising. 27 Protein flexibility has been widely reported, and there are three types of flexibility, 28 i.e., local flexibility (at atom/residue level) [16], regional flexibility (at 29 intra-domain/multi-residue level) [17], and global flexibility (at multi-domain level) [18]. 30 The local and regional flexibility are mainly caused by the movement of chemical bonds 31 and bond angles [19], while the global flexibility mostly comes from the hinges, 32 helical-to-extended conformations and side chains [20][21][22]. Taking as an example shown 33 in Fig 1, the two structures (having the Protein Data Bank (PDB) [23] identifier of 34 1A14 and 7NN9, respectively) have exactly the same sequence of residues, but notably 35 different structures after aligned. The 1A14 is a bonded structure interacting with the 36 anti-influenza virus neuraminidase, while the 7NN9 is an unbound structure. After 37 alignment, the averaged RMSD (root mean square deviation) of the atoms between the 38 two structures is 0.434 ± 0.636Å, while the RMSD of the atoms at the epitope region is 39 0.576 ± 0.748Å, which is 1.33 times larger than the former one, indicating a markedly 40 movement of conformation upon binding. More interestingly, the side chain RMSD of 41 the epitope residues is 0.853 ± 0.993Å, showing a big fluctuation; cf. Fig 1(A) and (C). 42 These observations indicate that using bonded epitope conformations as training data to 43 May 8, 2021 2/15 figure out epitopes from antigens that have unknown binding partners, to some extent, 44 is inaccurate. This has also been argued by Zhao et al [6]. Unfortunately, all existing 45 models, both experimental and computational, have overlooked this property. Panel B shows the connections between the residue asparagine (on chain N of 1A14 at position 400) and its neighbors. The solid lines are the connections without considering flexibility, while the dash lines are the connections after flexibility is incorporated. The number on the lines are the distance between the linked atoms measured in angstrom (Å) based on the complex of 1A14. Panel C illustrates the spatial discrepancy between the epitopes on 1A14 and 7NN9, and Panel D shows the superimposed structure between the two in cartoon without side chains. Structures are downloaded from the Protein Data Bank [23] and the figures are generated using the Chimera [24].
To overcome this problem, we propose a graph model that fully considers the local These edge-enriched graphs are subsequently partitioned into non-overlapping subgraphs 50 by a message flowing algorithm [25] and expanded into overlapping subgraphs via a 51 community detection algorithm [26]. Finally, a graph convolutional network (GCN) is 52 built to discriminate the expanded subgraphs into epitopes or non-epitopes. 53 Experimental results show that the proposed model elevates the F1-score to 0.656, 54 making an increment of 16.3% compared to the state-of-the-art. In addition, this model 55 is able to identify all single, multiple and overlapping epitopes simultaneously.  The antigen-antibody interacting complexes are obtained from published data [6], in 59 which all redundant antigens as well as epitopes are removed, and single, multiple, even 60 overlapping epitopes have been well recorded. This dataset contains 258 61 antigen-antibody complexes, in which 163 antigens have only single epitope (each 62 antigen has only one epitope), 42 have multiple epitopes (each antigen has more than 63 one separated epitopes) and 53 have overlapping epitopes (each antigen has at least one 64 pair of epitopes that are overlapped with each other).

Flexibility calculation 66
The conformation of the composing residues of an antigen are not fixed, rather they are 67 quite flexible. This flexibility mainly comes from the rotation of substructures around 68 the covalent bonds [27], particularly the local flexibility (at atom/residue level) [16] and 69 regional flexibility (at intra-domain/multi-residue level) [17], which can be quantified by 70 the torsion angles of residues [28][29][30]. See Table S1 for the detailed torsion angles of the 71 twenty standard amino acids.

72
A residue can have several torsion angles based on the number of atoms located at 73 the side chain. For instance, Alanine has no torsion angle, while Proline has five torsion 74 angles; cf. Table S1. A torsion angle is determined by a tuple of four consecutive atoms 75 within a residue [31], in which the angle is the dihedral angle between the two planes 76 formed by three continuously connected atoms. Take the tuple (C, C ↵ , C , C 2 ) in  To compute the oscillation of each atom due to distortion, we build a tree structure in which the root node is the C ↵ atom, the leaf node is the farthest atom, and the intermediate nodes are the atoms between the two. Note that, there can have multiple leaf nodes; cf. Fig 3. The content contained in each node is the trajectory of the atom after distorted. Suppose the vector of an arbitrary atom is v = hx, y, zi and the unit vector of the rotation axis is k, then the rotated vector v rot of v is where ✓ is the angle of the rotation. By this means, we are able to determine the 81 possible spatial locations of each atom from its static structure after distortion.

82
Note that, the flexibility of a buried atom is markedly reduced, even to none, after 83 protein folding. Thus, we only calculate the flexibility of the surface atoms in this study. 84 That is, the root node is replaced by the most superficially buried atom that is 85 connecting with the exposed atom during flexibility determination; cf.   details of each step are as follows.

90
Step 1: flexibility-aware graph construction 91 A graph G = (V, E) is generated from the surface atoms of an antigen, where V is the 92 set of nodes (surface atoms) and E is the set of edges (connections) between atoms in V . 93 An atom is deemed as a surface atom if its accessible surface area is no less than 94 10Å 2 [12] calculated by NACCESS [32] with the default probe size, while an edge is 95 generated if the Euclidean distance between any two nodes is less than 8Å [12]. For instance, after enrichment on edges of the graph generated from PDB 1A14, the 101 number of edges is increased from 635 to 936. To accelerate the determination of edges, 102 the KD-tree data structure is borrowed to calculate the distance between atoms.

103
The atom-level graphs are upgraded into residue-level graphs by removing redundant 104 and self-contained edges. An edge is deemed as redundant if there exist more than one 105 edge connecting the two nodes, while the self-contained edges are the ones generated 106 from the atoms of the same residue. Step 2: overlapping subgraph clustering 108 The residue-level graph is clustered into overlapping subgraphs via two steps: partition 109 and expansion. Partition splits the whole graph into non-overlapping subgraphs by 110 using the Markov clustering algorithm [25], while the expansion enlarges separated 111 subgraphs into overlapping subgraphs by using the local community detection algorithm 112 DMF-R [26].

113
Graph partition is conducted on the weighted graph by using MCL [25] with default parameters, in which the weight w i,j of an edge e i,j is calculated as non-epitopic residues. An edge e c i,j is deemed as a boundary edge if one node is epitopic 120 and the other is non-epitopic.

121
Subgraph expansion is carried out on the partitioned subgraphs by using the 122 DMF-R algorithm. Unlike existing community detection algorithms that require global 123 information, DMF-R only takes the local community as input and considers its 124 characteristics as well. Hence, it is suitable to our subgraph expansion problem, which 125 is in line with the nature of spatially proximity of epitopes. This step is critical to 126 identify overlapping epitopes as they have been widely reported [33,34] while graph 127 partition cannot solve this problem intrinsically. To our knowledge, there is only one 128 study aiming at overlapping epitope identification [6].

129
Note that, the quality of graph partition and expansion are highly related to the way 130 of graph construction, i.e., flexibility-aware or flexibility-agnostic. With flexibility-aware 131 graph construction, hubs may appear; otherwise, all nodes will have evenly distributed 132 degree. See results later.

133
Step 3: graph classification 134 Expanded overlapping subgraphs are further classified as epitopes or non-epitopes by a 135 newly designed GCN-based algorithm [35,36], which is composed of two graph 136 convolutional layers and two fully connected layers. The graph convolutional layer 137 accounts for graph embedding, while the fully connected layer is for feature learning.

138
Let G = (F, A) be an expanded subgraph having F 2 R n⇥d be the feature matrix and A 2 R n⇥n be the weighted adjacency matrix, where n is the number of nodes and d is the number of dimensions. The i-th graph convolutional layer is calculated as where W (i) 2 R d (i 1) ⇥d (i) is a trainable weight matrix to be optimized and (·) is the activation function realized as ReLU [37]. Regarding the output of the j-th fully connected layer, it is computed as where W (j) 2 R d (j 1) ⇥d (j) is the parameter matrix to be learned, b (j) 2 R n⇥1 is the bias 139 and (·) is ReLU as well. All the weights W (.) are optimized by the stochastic gradient 140 descent algorithm.

141
Due to the imbalanced nature of epitopic and non-epitopic subgraphs, we use the focal loss [38] to optimize the training process, which is defined as  The F1-score, recall and precision are used to quantify the performance of our model as well as existing models, which are defined as: Positive) is the number of non-epitopes predicted as epitopes incorrectly, and FN (False 146 Negative) is the number of epitopes deemed as non-epitopes. Among these metrics, 147 F1-score is more robust and meaningful as the number of non-epitopes is far larger than 148 that of epitopes.

149
Performance qualification 150 Our proposed model is evaluated by using leave-one-out cross validation on the 258 151 complexes. The averaged F1-score, recall and precision are 0.656 ± 0.138, 0.587 ± 0.191 152 and 0.818 ± 0.150, respectively. Comparing to the exiting best mode Glep [6], the 153 proposed model has made remarkable advancement on epitope prediction, achieving an 154 increment of F1-score by 16.3%. On the same dataset, the F1-score achieved by existing 155 approach ElliPro [39], DiscoTope 2.0 [40], EpiPred [41] and Glep [6]  The key contribution of this study is the incorporation of flexibility into the epitope 165 prediction model. Thus, we carefully examine the impact of flexibility by keeping all 166 other steps unchanged but toggling flexibility only.

167
On average, the F1-score is 0.656 ± 0.138 and 0.599 ± 0.157 for the flexibility-aware 168 and flexibility-agnostic model, respectively. Comparing to the later, the former one lifts 169 the F1-score by 9.5%. Without flexibility the F1-score of the proposed model is only 170 increased by 0.035 compared to the state-of-the-art models [6]. However, this value is  It is obvious that flexibility has great impact on epitope prediction. We speculate 174 that the impact is mainly from the denser connections of the graphs when flexibility is 175 considered. Based on the data, we found that the averaged degree of each node (residue) 176 was increased from 6.39 ± 1.71 to 9.04 ± 2.58, achieving an increment of 41.5%; see epitopes either separated or overlapped [44,45]. These epitopes together, however, do 194 not form a big epitope, rather a set of epitopic residues. To solve this problem, we 195 expand subgraphs that are generated from graph partition into larger ones so that 196 shared epitopic residues can be included by more than one epitope, which facilitates the 197 identification of overlapping epitopes.

198
Among the dataset, there have 53 overlapping epitopes. By using our approach, we 199 can identify 96.2% of them, achieving an averaged F1-score of 0.678 at residue level.

200
Compared to the unexpanded version, the averaged F1-score is increased by 31.6%.

201
Expansion is also helpful to improve the performance of single and multiple epitopes 202 identification; see  After two rounds of learning rate decrement, the model is converged to a stable 216 minimum. At this state the best classification accuracy of the test set is 83.1%, and the 217 corresponding F1-score is 0.704.

219
Epitopes of an antigen can be neutralized by antibodies, identification of these epitopes 220 can be helpful to various applications, e.g., vaccine design and drug development.

221
Hence, intensive efforts have been made to solve this thorny problem, from both 222 experimental [46,47] and computational perspectives [48,49], particularly the later one. 223 Although chemo-physical properties of antigens, such as hydropathy index [50], 224 protrusion index [39] and statistical proximity [51], have been well explored to identify 225 epitopes, all of them are determined from the static structure of antigens, either bonded 226 or unbound. However, the structure of an antigen is not fixed, particularly the side 227 chain of the surface residues [52]. To incorporate such important information, we have 228 proposed a novel graph-based model that is flexibility-aware. Our model starts with partitioned into subgraphs by using MCL [25], and are expanded into overlapping 232 subgraphs by DMF-R [26]. Finally, these expanded graphs are classified into epitopes or 233 non-epitopes by a newly built graph convolutional network. Experimental results show 234 that the proposed approach outperforms all existing models, achieving a high F1-score 235 of 0.656. This score makes a 16.3% increment compared to that of the state-of-the-art 236 models. Besides, our model can identify single, multiple and overlapping epitopes 237 simultaneously. Although there still has notably room to improve the performance, this 238 study pioneers a new direction for improving epitopes identification.   Note, the side chain torsion angle must be formed from at least two heavy atoms, thus the GLY and ALA have no torsion angle.