Foldseek: fast and accurate protein structure search

Highly accurate structure prediction methods are generating an avalanche of publicly available protein structures. Searching through these structures is becoming the main bottleneck in their analysis. Foldseek enables fast and sensitive comparisons of large structure sets. It reaches sensitivities similar to state-of-the-art structural aligners while being four to ﬁve orders of magnitude faster. Foldseek is free open-source software available at foldseek.com and as a webserver at search.foldseek.com.

Foldseek (Fig. 1a) (1) discretizes the query structures into (1) Find neighboring residues using virtual center (2) Extract features (3) Search 3Di state library (4) (Discretization) Conversion to 3Di sequence   (2) To describe the interaction geometry of residues i and j, we extract seven angles, the Euclidean Cα distance, and two sequence distance features from the six Cα coordinates of the two backbone fragments (blue, red to the first FP (Fig. 2a). 108 Foldseek reaches sensitivities at family and superfamily 109 level below Dali, higher than the structural aligner CE, and 110 similar to TM-align and TM-align-fast. Foldseek is much 111 more sensitive than structural alphabet-based search tools 3D-

112
BLAST and CLE-SW ( Fig. 2a-b). Similarly, Foldseek has the 113 second highest area under the precision-recall curve on each 114 of the three levels ( Fig. 2c, Supplementary Fig. 5). The

115
performance is comparable across all six secondary structure 116 classes in SCOPe (Supplementary Fig. 6). On this small 117 SCOPe40 benchmark set, Foldseek is more than 4,000 times 118 faster than TM-align and Dali, and over 21,000 times faster 119 than CE (Fig. 2b) We analyzed the quality of alignments produced by the top 147 five matches per query. We computed the alignment sensitiv-148 ity as the number of TP residues divided by the query length 149 and the precision as the number of TP residues divided by 150 the alignment length. TP residues are those with residue-151 specific LDDT score above 0.6, FP residues are below 0.25,

152
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  residues with other scores are ignored. Fig. 2f shows the av-

202
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  Foldseek, which build on the MMseqs2 framework [26], are de-277 scribed in the following paragraphs.   Foldseek, MMseqs2, and 3D-Blast, (Fig. 2d for SCOPe40 and 360 Supplementary Fig. 12 for AlphaFoldDB). We considered a 361 hit as FP if it was in a different fold and had a TM-score lower  Pairwise global structural alignments using TM-align 368 We also offer the option to use TM-align for pairwise struc-  the C α atoms (C α,i−1 , C α,i , C α,i+1 ) and (C α,j−1 , C α,j , C α,j+1 ).

404
The descriptors use the five unit vectors along the following 405 directions, We define the angle between u k and u l as ϕ kl , so cos ϕ kl =    Additionally, we plotted precision-recall curves for each tool 524 (Fig. 2c, Supplementary Fig. 5). After sorting the align-   We measured runtimes on AlfaFoldDB for searches with the 542 same 100 randomly selected queries used for the sensitivity 543 and alignment quality benchmarks (Fig. 2e,f)   This does not properly penalize non-compact models in which 576 each residue has few neighbors within 15Å. We therefore use as 577 denominator the total number of neighboring residues within 578 15 Å of i. 579 For the alignment quality benchmark (Fig. 2f), we classi-580 fied each aligned residue pair as TP or FP depending on its 581 residue-wise LDDT score, that is, the fraction of distances 582 to its 15 Å neighbors that are within 0.5, 1, 2, and 4 Å 583 of the distance to the corresponding residues in the query, 584 averaged over the four distance thresholds. TP residues are 585 those with a residue-wise LDDT score of at least 0.6 and FPs 586 below 0.25, ignoring matches in-between. For the sensitivity 587 benchmark (Fig. 2e), TP residue-residue matches are those 588 with an LDDT score of the query-target alignment of at least  . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted September 20, 2022.   The webserver code is available at github.com/soedinglab/ 763 mmseqs2-app.