Protein structure search to support the development of protein structure prediction methods

Protein structure prediction is a long-standing unsolved problem in molecular biology that has seen renewed interest with the recent success of deep learning with AlphaFold at CASP13. While developing and evaluating protein structure prediction methods, researchers may want to identify the most similar known structures to their predicted structures. These predicted structures often have low sequence and structure similarity to known structures. We show how RUPEE, a purely geometric protein structure search, is able to identify the structures most similar to structure predictions, regardless of how they vary from known structures, something existing protein structure searches struggle with. RUPEE accomplishes this through the use of a novel linear encoding of protein structures as a sequence of residue descriptors. Using a fast Needleman-Wunsch algorithm, RUPEE is able to perform alignments on the sequences of residue descriptors for every available structure. This is followed by a series of increasingly accurate structure alignments from TM-align alignments initialized with the Needleman-Wunsch residue descriptor alignments to standard TM-align alignments of the final results. By using alignment normalization effectively at each stage, RUPEE also can execute containment searches in addition to full-length searches to identify structural motifs within proteins. We compare the results of RUPEE to mTM-align, SSM, CATHEDRAL and VAST using a benchmark derived from the protein structure predictions submitted to CASP13. RUPEE identifies better alignments on average with respect to RMSD and TM-score as well as Q-score and SSAP-score, scores specific to SSM and CATHEDRAL, respectively. Finally, we show a sample of the top-scoring alignments that RUPEE identified that none of the other protein structure searches we compared to were able to identify. The RUPEE protein structure search is available at https://ayoubresearch.com. Code and data are available at https://github.com/rayoub/rupee.


Introduction
with no dependence on sequences or clustering. We compared our results with the 23 mTM-align structure search [8], the secondary structure matching (SSM) search [9], and 24 the CATHEDRAL structural scan [10], and found RUPEE is equal to or better than 25 some of the best available structure searches on a benchmark of known protein 26 structures. Additionally, we showed RUPEE, on average, returns results faster than 27 mTM-align and CATHEDRAL. 28 Since the release of RUPEE [7], we have observed that RUPEE has been used to 29 upload protein structures that were the output of a protein structure prediction method 30 in order to identify the most similar known structure to the predicted structures. For 31 the most part, these uploaded protein structures have had low sequence and structure 32 similarity to known structures in the PDB. This low similarity to known structures is to 33 be expected given the limited accuracy of current protein structure prediction methods. 34 When searching for structures with low sequence and structure similarity to known 35 structures, the importance of small differences in structure similarity becomes 36 proportionally larger since they comprise a larger percentage of the overall similarity. 37 Protein structure searches that rely on sequences or clustering may miss these small 38 differences in similarity because they do not consider every structure individually. 39 Moreover, with respect to the use of sequences, while high sequence similarity usually 40 indicates high structure similarity [11], high structure similarity has been observed even 41 for structures with low sequence similarity since structure is more conserved in sequences or clustering, it often lacks sufficient sensitivity to find the most similar 48 match for a structure with low structure similarity to known structures. This lack of 49 sensitivity for RUPEE top-aligned search mode with respect to low similarity searches is 50 due to the lower accuracy of its initial structure similarity estimates used to filter 51 candidate matches. Recognizing the need for a structure search with more sensitivity 52 than top-aligned search mode and still having no dependence on sequences or clustering, 53 we have added an additional search mode to RUPEE with increased sensitivity called 54 all-aligned search mode. 55 Like our previous work on RUPEE [7], again we compare the results of RUPEE 56 against mTM-align [8], SSM [9] and CATHEDRAL [10], but this time we do so for 57 all-aligned search mode. Additionally, this time we also compare to the VAST protein 58 2/16 structure search [13]. For comparisons, we use a new benchmark derived from protein 59 structure predictions of free-modelling targets in CASP13 available at the CASP web 60 site [14]. Previously we showed that RUPEE, in top-aligned search mode, is equal to or 61 better than those we compared to for a benchmark of known protein structures. Here, 62 we show that RUPEE, in all-aligned search mode, is better than those we compare to 63 using a benchmark drawn from the output of protein structure prediction methods.

64
While it is possible to perform a protein structure search by exhaustively comparing 65 to every available structure using pairwise structure alignments, we do not compare 66 RUPEE to pairwise structure alignment tools used in this manner. If an exhaustive 67 search is practical, it will always be optimal with respect to the pairwise structure 68 alignment tool used. However, an exhaustive search may not be practical given the 69 resources needed to achieve a desired response time. To illustrate this point, consider 70 the number of pairwise structure alignments needed to perform a single search because 71 the pairwise structure alignments dominate the response time for RUPEE. For searching 72 whole PDB chains in the RUPEE database, presently containing roughly 440,000 73 structures, RUPEE performs only 8000 pairwise structure alignments. Given the same 74 amount of resources, an exhaustive search will take roughly 440, 000/8000 = 55 times 75 longer. With the number of structures available in the PDB growing rapidly, this 76 difference will continue to increase. Therefore, we only compare to other protein 77 structure searches which are designed, like RUPEE, to return quality results with much 78 faster response times.

79
The protein structure searches we compare to represent a good mix of approaches.

80
For structure searches that depend on sequences and clustering, mTM-align [8] is among 81 the best available and is capable of handling uploaded structures with the same 82 response-times as for searching on a structure id. SSM [9] is a good example of a fast 83 graph-theoretic structure search with no dependence on sequences and clustering.

84
However, SSM's speed is at the expense of sensitivity since it depends on the spatial 85 orientation and connectivity of secondary structure elements, which fails to capture the 86 complexity of loops. CATHEDRAL [10] also uses a fast graph-theoretic approach that is 87 more accurate than SSM, but still lacks sufficient sensitivity to identify the most similar 88 structure matches for low similarity searches. Moreover, although the CATHEDRAL 89 structural scan does not depend on clustering directly, it only searches on and returns 90 results for representatives of sequence clusters at 35% similarity [15]. The VAST protein 91 structure search [13] is similar to SSM in that it depends on the spatial orientation and 92 connectivity of secondary structure elements. Although VAST is much slower than SSM 93 for uploaded structures, its search database is more recent than that of SSM.

94
Previously, the RUPEE protein structure search focused on full-length matches. If 95 you wanted to search for structures similar to a domain, you could search one of the 96 protein domain classification databases such as SCOPe [16], CATH [15], or ECOD [17]. 97 On the other hand, if you wanted to search for structures similar to a whole chain, you 98 could search whole chains from the PDB. While the ability of RUPEE to search 99 multiple databases for any given structure is flexible, we have extended this flexibility  Together, all-aligned search mode and containment searches are significant additions 106 to RUPEE that support the development of protein structure prediction methods by 107 allowing researchers to find the structures most similar to the output of their 108 predictions, regardless of how they vary from known protein structures. 109

110
We first give a brief outline of our linear encoding of protein structures described in 111 more detail in our previous work on RUPEE [7], which still remains at the core of the 112 RUPEE protein structure search. Then, we describe our approach to top-aligned to 113 provide context followed by the addition of all-aligned. 114 Linear encoding of protein structure 115 Previously [7], we introduced a linear encoding of protein structures based on torsion 116 angle regions. We determined these regions by plotting a random sampling of torsion 117 angles. Traditionally, the Ramachandran plot would be used for the plotting. However, 118 we found that the discontinuities within the Ramachandran plot due to the cyclical 119 nature of torsion angles made the Ramachandran plot poorly suited for this task. 120 Instead, we plotted the sampled torsion angles on a polar plot we also introduced in [7] 121 to clearly define continuous regions. Figure 1 shows our polar plot alongside its 122 corresponding Ramachandran plot and Figure 2 shows the defined regions for helices, 123 strands, and coil using the polar plot. We assigned an integer, referred to as a residue 124 descriptor, to each defined region. The DSSP [18] secondary structure assignment codes 125 for turns ('T') and bridges ('B') are assigned descriptors 11 and 12 respectively, As an example of our linear encoding, Figure 3 shows a typical β-turn-β motif annotated with the residue descriptors corresponding to the sequence shown below. The underlined elements in (1) correspond to the underlined elements in (2), (4), and (5) below to help illustrate the subsequent transformations from descriptors to shingles and finally to hashes. [ 5,5,5,5,5,5,7,5,11,11,5,5,5,5,5,5 ] (1) Next, we derive a multiset of overlapping 3-grams of residue descriptors, where a 128 3-gram is three consecutive residue descriptors. This representation is often referred to 129 as shingling, given their likeness to overlapping roofing shingles [19]. The overlap  between shingles ensures some of the order information within the original sequence is 131 preserved in the multiset.

132
By shingling, we obtain a multiset of ordered sequence from an ordered sequence of residue descriptors. As an example, the sequence in (1) becomes the following multiset of shingles.
After hashing, the multiset in (2) becomes the following multiset of integers.
The hashing completes the transformation of an ordered sequence of residue 133 descriptors to a multiset of integers that still retains some of the order information 134 present in the original sequence.
135 Figure 4. Flowchart for the RUPEE top-aligned and all-aligned search modes. It can be assumed that linear encodings and min-hashes for all structures other than the query structure have been stored via an offline indexing process and are accessible throughout the flowchart.
In (4) the value 915, corresponding to the shingle [5,5,5], occurs frequently indicating the presence of β-strands. To address this lack of specificity, we introduced a heuristic we call run position encoding (RPE) [7], where a run is a consecutive sequence of identical descriptors. To distinguish between short and long runs, thereby increasing the specificity of the shingles, we add a factor of 10 5 to each shingle hash as a function of the first residue's position in a run. Applying RPE to the multiset of integers in (4) where the leading zero run factors are shown for clarity.

136
The pyramidal approach for the run factors used in RPE preserves matches at the 137 boundaries between secondary structure runs and loops that would not otherwise be 138 preserved in the presence of differences in run lengths of one or more.

140
Before RUPEE can service a search request, an offline process has to be executed in 141 order to index the available protein structures. This index consist of residue coordinates, 142 3-grams, min-hashes and band-hashes for LSH banding. If a user searches on a structure 143 id, its representation will already be stored in the index. On the other hand, if a user 144 uploads a protein structure, it will be parsed into residue coordinates, 3-grams, 145 min-hashes and band-hashes. Aside from the initial parsing, searches on uploaded 146 structure are identical to searches by structure id. Figure 4 shows a flowchart for the 147 RUPEE top-aligned and all-aligned search modes described below.

148
Top-aligned search mode 149 When representing each protein structure using a multiset of integers as shown in (5), we define full-length similarity for a candidate pair of structures a and b using the Jaccard similarity for multisets [20], where i ranges over all possible shingle hashes s i and a i and b i give the counts of shingle 150 hash s i in structures a and b, respectively.
For top-aligned, RUPEE uses min-hashing [21] and locality sensitive hashing 152 (LSH) [22] as described in detail in our previous work [7]  Jaccard similarity estimates to complete the initial filtering.

156
Next, regarding the multiset in (5) as an ordered sequence of integers, we obtain 157 more accurate similarity scores for the filtered matches by performing global alignments 158 using the residue descriptor sequences with a simplified Needleman-Wunsch (NW) [23] 159 dynamic programming algorithm where mismatches and gaps are penalized -1 points 160 and matches are awarded +1 points. For containment searches, depending on whether 161 or not the search type is Contained-In or Contains, one of the sides of the dynamic 162 programming matrix is not penalized for the opening gap and end gap. This is often 163 referred to as semi-global sequence alignment [24].

164
After the NW algorithm is performed on the top 40000 protein structures from the 165 initial filtering, the top-scoring 8000 structures are obtained for subsequent pairwise 166 structure alignments. Pairwise structure alignment, the most accurate method for 167 comparing protein structures, involves finding a set of spatial rotations and translations 168 for two protein structures that minimizes a distance metric. Traditionally, the root 169 mean squared deviation (RMSD) between α-carbons of aligned residues is minimized.

170
However, the RMSD score does not factor in the distance between unaligned residues 171 nor does it consider the percentage of aligned residues, that is, alignment coverage.

172
RMSD scores also have some dependence on the length of the aligned proteins. On the 173 other hand, the TM-score [25] takes all residues into account and normalizes for both 174 coverage and length of the aligned proteins. For this reason, TM-score is frequently 175 used in scoring protein structure alignments and accordingly, we use it for our alignment 176 scoring.

177
RUPEE uses TM-align [26] for performing pairwise structure alignments. TM-align 178 uses a rotation matrix designed to maximize the TM-score rather than minimizing the 179 RMSD along with dynamic programming to find the best alignment. Similar to how we 180 used global and semi-global sequence alignments with the NW algorithm for full-length 181 and containment search types respectively, we apply the same logic to how greater detail in our previous work on RUPEE [7].

188
When doing the pairwise alignments, we take the top-scoring 8000 structures from 189 the NW alignment filter and perform structure alignments using TM-align with a 190 reduced number of dynamic programming iterations. Next, we sort these structure 191 alignments by TM-score and obtain the top-scoring 400 structures. Finally, we perform 192 structure alignments using TM-align on the top-scoring 400 structures using the default 193 number of dynamic programming iterations and return the results sorted by TM-score. 194 The filter sizes of 40000, 8000 and 400 have been chosen based on quality of results 195 and speed. We have found that increasing the size of either of these filters results in  While top-aligned may be sufficient for searching for known protein structures [7], the 202 need for greater sensitivity arises when searching with structure predictions that may 203 only have a maximum TM-score of less than 0.50 when compared against all available 204 structures. Furthermore, for top-aligned, the effectiveness of the containment searches is 205 limited by the initial filtering using Jaccard similarity estimates, which biases the initial 206 filtering toward full-length matches. We address both of these concerns with the 207 addition of all-aligned.

208
In contrast to top-aligned, all-aligned skips the initial step of using min-hashing and 209 LSH filtering. Instead, all-aligned runs the NW algorithm on all available structures to 210 obtain the residue descriptor sequence alignments rather than the NW alignment scores 211 as in top-aligned. The residue descriptor sequence alignments are then passed into 212 TM-align as the initial alignments and TM-align is set to stick to those initial

226
Like our previous work on RUPEE [7], we compare the average scores of ranked results 227 to those of mTM-align [8], SSM [9], CATHEDRAL [10] and VAST [13]. However, this 228 time we compare results for both RUPEE all-aligned and top-aligned search modes and 229 instead of using a benchmark of known protein structures we use a benchmark derived 230 from structure predictions submitted to CASP13. We perform pairwise comparisons 231 with each structure search individually to reduce sources of systemic error in our 232 evaluation. Each of these pairwise comparisons to mTM-align, SSM, CATHEDRAL and 233 VAST is discussed in its own section below. We also compare RUPEE all-aligned to 234 RUPEE top-aligned using the benchmark of predicted protein structures and a 235 benchmark of known protein structures. After comparing results, we provide a sample 236 of the top-scoring alignments RUPEE was able to identify that mTM-align, SSM,

237
CATHEDRAL and VAST all failed to identify.

238
To evaluate the results of RUPEE against mTM-align [8], SSM [9], 239 CATHEDRAL [10] and VAST [13] for the case of providing support for the development 240 of protein structure prediction methods, we derived our initial benchmark from 241 structure predictions submitted to CASP13. To ensure the benchmark was challenging, 242 we only considered predictions submitted for the 25 single-segment free-modeling (FM) 243 target domains in CASP13 [14]. To ensure the benchmark was not too challenging, we 244 only considered the first designated predictions of the top 10 performing groups ranked 245 by the Assessors' formula (GDT TS + QCS) applied to free-modeling targets. We call 246 this benchmark casp d250 since it consists of 250 structures, corresponding to 25 target 247 domains for each of the 10 top-performing groups. The top 10 performing CASP13 248 prediction groups are shown in Table 1. included in the benchmark and appear in the title of each plot below.

261
All benchmark definitions can be found in S1 Benchmarks. We uniquely identify 262 each benchmark structure using the format 〈target〉TS〈group〉-〈domain〉. For instance, 263 the prediction submitted by the AlphaFold team named A7D and numbered 043 for the 264 second domain in the target T0960 is referred to as T0960TS043-D2. Full-Length that all depend on the TM-score as described above, we have added the 269 additional search types Q-score [9] and SSAP-score [28] to RUPEE in order to perform 270 comparisons to SSM and CATHEDRAL using their native scores, respectively. We also 271 have added the RMSD search type to RUPEE for its general usefulness and ubiquity.

272
The additional search types demonstrate the pluggable nature of RUPEE. Although we 273 still use TM-align for all internal pairwise alignments after the initial filtering and NW 274 alignments, we are able to easily apply different scores to the resultant alignments 275 provided by TM-align besides the TM-score.

276
Given that we use a number of different scores for evaluations that may differ from 277 the native scoring used by a structure search, we take their top 100 results based on 278 their native scoring and re-sort them based on the compared score before making 279 comparisons. If we did not re-sort by the compared score, it would allow poor scoring 280 structures based on the compared score to remain at higher ranks and bias the results 281 favorably for RUPEE. Due to this re-sorting, we are careful not to draw conclusions  aggregate to be evaluated.

284
For some comparisons below, we compare the RUPEE structure search to a protein 285 structure search using a score that the latter may not have a corresponding search type 286 for. For example, CATHEDRAL does not have an option to search or sort by RMSD. In 287 our comparison to CATHEDRAL using RMSD, we are not justified in saying RUPEE is 288 better than CATHEDRAL because of the RMSD comparison. However, we can say that 289 RUPEE is better than CATHEDRAL with respect to RMSD and in so far as RMSD is 290 a good measure of structure similarity, this comparison is useful. Below, we make 291 comparisons using TM-score and RMSD in all cases. However, where possible, we also 292 compare to other searches using scoring schemes native to those searches.

293
In the comparisons for the TM-score, we normalize by either the length of the query 294 structure or the average length of structures being compared. For the TM-score plots, 295 in the vertical axis, we use (q) to indicate normalization by the length of query 296 structure or (avg) to indicate normalization by the average length of the two structures. 297 Scoring vs. mTM-align 298 In Figure 5, we compare the average TM-scores [25] and RMSD scores of the top 100 299 results for RUPEE and mTM-align [8] using whole PDB chains deposited in the PDB as 300 of 2020-01-01. TM-scores and RMSD scores have been calculated using TM-align [26]. 301 For comparing TM-scores, we use the RUPEE Contained-In search type to search by 302 TM-score normalized by the query structure identical to the scoring used by mTM-align. 303 For comparing RMSD scores, we use the RUPEE RMSD search type. mTM-align does 304 not provide an RMSD search type so when comparing by RMSD we sort the top 100 305 mTM-align results by RMSD. favorably with RUPEE all-aligned and top-aligned for the first 10 results, staying within 309 0.01 TM-score points, but then drops off precipitously. In the left plot of Figure 5 we do 310 not cut off the plot after 10 results because that would be unfair to RUPEE to not In Figure 6, we compare the average Q-scores [9], TM-scores [25] and RMSD scores of 322 the top 100 results for RUPEE and SSM [9] using SCOP v1.73 domains. We calculated 323 TM-scores and Q-scores using TM-align [26] and RMSD scores using the Combinatorial 324 Extension (CE) [29] algorithm. For the TM-score comparison, we normalize by the 325 average length of the compared structures. In addition to TM-score and RMSD, we also 326 compare on Q-score because SSM provides an option to search by Q-score and RMSD. 327 We calculated our own Q-scores and RMSD scores for SSM because we observed the 328 scores they provided are wildly incorrect in many cases. For instance, a large set of 329 results all start with a block of perfect matches with a Q-score of 1.0 and RMSD of 0.0, 330 which is clearly impossible given that we are searching with predicted structures. We 331 did not observe this problem with SSM when searching on known protein structures.

332
As shown in Figure 6, both RUPEE all-aligned and top-aligned perform better than 333 SSM at all ranks for Q-scores, TM-scores and RMSD scores. The similarity between the 334 plots for Q-scores and TM-scores suggest some correspondence between how Q-scores 335 and TM-scores are calculated. Both the Q-score and the TM-score are intended as good 336 measures of full-length similarity.

340
For comparing to CATHEDRAL, we filter by CATH s35 cluster representatives since 341 that is all that CATHEDRAL returns. For the TM-score comparison, we use TM-align 342 to calculate the scores, and we normalize by the average length of the compared 343 structures. We use the cath-ssap tool provided in the cath-tools suite [30] to calculate 344 SSAP-scores for comparisons. As we do for SSM above, we calculate RMSD scores using 345 CE.  Figure 7, both RUPEE all-aligned and top-aligned perform better than 347 CATHEDRAL at all ranks for SSAP-scores, TM-scores and RMSD scores. It is 348 remarkable that RUPEE performs better than CATHEDRAL using the score that the 349 CATHEDRAL search is based on.

351
In Figure 8, we compare the average TM-scores and RMSD scores of the top 100 results 352 for RUPEE and VAST. For the TM-score comparison, we use TM-align to calculate the 353 scores, and we normalize by the average length of the compared structures. We were not 354 able to duplicate the VAST-score ourselves for our internal alignment scoring and so do 355 not provide an additional search type as we did for the Q-score and the SSAP-score. As 356 we do for SSM and CATHEDRAL above, we calculate RMSD scores using CE. 357 We run RUPEE against whole PDB chains deposited in the PDB as of 2020-01-01  VAST at all ranks for TM-scores and RMSD scores. In the left plot of Figure 8, for the 365 TM-score comparison, we use the VAST-score sort provided by VAST and in the right 366 plot, for the RMSD comparison, we use the RMSD score sort provided by VAST when 367 collecting the data and then sort the top 100 results based on the compared score as 368 usual. Since the VAST-score is roughly a full length score, we use the TM-score For full length searches, in Figure 9, we compare the results of RUPEE all-aligned to 372 RUPEE top-aligned for two different benchmarks against the same structure database, 373 SCOP v2.07, consisting of more than 250,000 structures. The easier benchmark, 374 scop d360, is the benchmark of known proteins structures that we used in our previous 375 work on RUPEE [7]. The harder benchmark, casp d250, is the benchmark of protein 376 structure predictions from above. For scop d360, the difference between RUPEE 377 all-aligned and top-aligned is only a fraction of a TM-score point, whereas for casp d250, 378 all-aligned is 0.01 to 0.02 TM-score points better than top-aligned across all ranks 379 except for the first 10. Figure 9 suggest that RUPEE all-aligned is more suitable than 380 RUPEE top-aligned for searching on protein structure predictions. However, for known 381 protein structures, the performance of top-aligned is almost identical to all-aligned.

382
For RMSD searches, the right plots of Figs 5, 6, 7 and 8 all shows RUPEE 383 all-aligned doing significantly better than top-aligned across all ranks. This is notably 384 different from Figure 9. Part of the reason for this larger difference is that all-aligned 385 mode starts filtering results on the compared score much earlier in the pipeline whereas 386 top-aligned mode starts filtering results on the compared score only after the initial 387 filtering and NW alignments.

388
Sample Alignments 389 Figure 10 shows the structure alignments of the top-scoring full-length structures 390 matches that RUPEE identified that none of the other protein structure searches we 391 compared to identified. While we recognize that mTM-align, SSM, CATHEDRAL and 392 VAST may have performed equal to or better than RUPEE on some benchmark  problem with finding despite the complexity of the loops. We believe it is these types of 396 difficult matches that would be of interest to researchers investigating protein structure 397 prediction.

399
With the recent successes of coevolutionary methods at CASP11 [5] and deep learning 400 with AlpahFold at CASP13 [3], the long-standing problem of protein structure 401 prediction has seen renewed interest. Despite this renewed interest, the problem of 402 identifying the most similar known protein structures to structure predictions has not 403 been explored and researchers are left navigating a variety of structure searches not 404 specifically designed for this purpose. In addition to matching the performance of some 405 of the best available protein structure searches on a benchmark of known protein 406 structures as shown in our previous work on RUPEE [7], we have now shown that 407 RUPEE effectively addresses the problem of searching on structure predictions and is 408 uniquely suited to support the development of protein structure prediction methods.