GTalign: Spatial index-driven protein structure alignment, superposition, and search

In response to the challenge posed by vast protein structure repositories, GTalign offers an innovative solution to protein structure alignment and search—an algorithm that achieves optimal superposition at high speeds. Through the design and implementation of spatial structure indexing, GTalign parallelizes all stages of superposition search across residues and protein structure pairs, yielding rapid identification of optimal superpositions. Rigorous evaluation across diverse datasets reveals GTalign as the most accurate among structure aligners while presenting orders of magnitude in speedup at state-of-the-art accuracy. GTalign’s high speed and accuracy make it useful for numerous applications, including functional inference, evolutionary analyses, protein design, and drug discovery, contributing to advancing understanding of protein structure and function.


Introduction
In contemporary structural bioinformatics, the advent of advanced artificial neural network architectures [1][2][3] has ushered in an era where protein structures are predicted with high accuracy for a myriad of protein sequences [4,5].This surge in structural data has presented a challenge: the need for efficient and rapid protein structure comparison to distill meaningful insights from the burgeoning repositories of three-dimensional protein structures.Such tools enable the extraction of biologically relevant information, decipher evolutionary relationships [6][7][8], and contribute significantly to understanding functional mechanisms encoded within protein structures [9][10][11].In this context, the development of computational tools for rapid large-scale protein structure alignments represents an important step forward.
The need for efficient tools gave rise to a variety of computational techniques, each with its own strengths and limitations, aiming to achieve accurate and rapid comparisons of protein structures.Two fundamental approaches have emerged: local pattern matching and rigid-body superposition optimization.Local pattern matching involves analyzing structures independently, making it suitable for handling flexible regions like linkers between protein domains.This approach encompasses various strategies such as optimizing the match between protein inter-residue distance matrices [12] or probability distributions [13], aligning secondary structure elements using double dynamic programming [14] or graph matching [15], finding and extending [16] or chaining [17] aligned fragment pairs with optimal inter-residue distance matching, and quantifying evolutionary similarity to infer initial alignments [13,18].
On the other hand, methods based on rigid-body superposition treat structures as rigid bodies and focus on optimizing their local or global spatial agreement.These methods may involve optimizing superposition for protein fragments [19], iterative superposition at different distance cutoffs [20], or multi-stage fragmentbased superposition optimization [21].To achieve accurate pairwise protein alignments, these methods [20,21], and those employing local pattern matching [14][15][16][17][18], apply iterative alignment refinement that includes rigid-body superposition.
While approaches like TM-align [21] and Dali [22] have been essential in advancing protein structure comparison, their computational complexity has made them less practical for large-scale applications to growing protein structure databases.To address this challenge, some strategies involve preprocessing The results (Fig. 1b-d, Supplementary Section S1.1, Supplementary Fig. S1a, Supplementary Table S1) show that GTalign consistently outperforms all the aligners in terms of accuracy.GTalign (option --speed=0) produces up to 7% more alignments with a TM-score ≥ 0.5 than TM-align, the second most accurate tool (732,024 vs. 683,996, SCOPe40 2.08 dataset).This trend persists across the entire TM-score significance range from 0.5 to 1.0 (Supplementary Table S2).However, there are exceptions: GTalign reports 191 fewer alignments than TM-align for TM-scores ≥ 0.7 in the PDB20 dataset and 31 fewer alignments for TM-scores ≥ 0.8 in the Swiss-Prot dataset.Further investigation revealed that these differences stem from aligning queries with short proteins with fewer than 30 residues.For a pair of proteins, GTalign employs approximate partial sorting to select candidate alignments for detailed refinement.Alignments for very small proteins or peptides score similarly, and this approximation can occasionally lead to a suboptimal final alignment.
While the issue of short proteins is recognized and left for future resolution, analysis based on the TMscore normalized by the query length effectively diminishes the significance of alignments between queries and much smaller proteins.In this evaluation setting (Fig. 2, Supplementary Fig. S1b, Supplementary Table S3), GTalign demonstrates superior accuracy over the other aligners again.For example, in the SCOPe40 2.08 dataset, GTalign (option --speed=0) produces up to 7% more alignments with a TM-score ≥ 0.5 than TM-align (492,887 vs. 460,847).Disregarding insignificant differences of one alignment, GTalign exhibits superiority across the full TM-score significance range (Supplementary Table S4).
GTalign also demonstrates higher accuracy on the HOMSTRAD dataset (Fig. 3), which contains reference structure alignments grouped into evolutionarily and structurally related protein families.These families exhibit relatively high structural similarity, as evidenced by the TM-score distribution of the reference alignments (Fig. 3).Therefore, these results highlight GTalign's utility in improving structural alignments, even among highly similar proteins.This is further supported by the greater accuracy of GTalignproduced alignments compared to the reference alignments, suggesting potential applications in constructing reference datasets and classifying proteins.A similar trend was observed previously [34].
Additionally, we provide structural examples in Supplementary Figs.S2-S5, each corresponding to one dataset used in our benchmark study.These examples highlight nontrivial structural similarities identified by GTalign but overlooked or misaligned by all other benchmarked tools.Notably, all showcased examples demonstrate domains (even from different folds; see a subsection below) or larger significant structural segments, with insertions and deletions, sharing the same topology, as confirmed by the TM-scores and structural alignments.Therefore, Supplementary Figs.S2-S5 exemplify GTalign's primary objective of achieving optimal protein spatial superpositions and detecting subtle yet significant structural similarities.
Clearly, the sequence prefiltering strategy contributes to Foldseek's high speed.However, this comes at a high price in accuracy and sensitivity (only 13,371 alignments with a TM-score ≥ 0.5; SCOPe40 2.08 dataset, Fig. 1b; see also Fig. 2 and Supplementary Table S2 and S4).When GTalign is configured to use sequence prefiltering (options --speed=13 --pre-similarity=15), a similar pattern emerges, with runtimes comparable to Foldseek (428 vs. 196 seconds, Swiss-Prot dataset) but decreased sensitivity.In contrast, no such effect is observed for prescreening in the structure space (option --pre-score; Fig. 1 and 2).This phenomenon can be attributed to at least two factors.First, low sequence similarity does not necessarily correlate with low structural similarity, as demonstrated by the results.Second, the generation of accurate structural alignments using a single scoring scheme per protein pair may lack consistency.To address this, we explored the use of scores derived from spectral analysis of rotation-invariant twodimensional representations of geometric features, such as angles and distances between residues, in the frequency (Fourier) domain.However, this approach demonstrated inconsistent results and requires further investigation.Despite these observations, leveraging prescreening in the sequence space can prove valuable for high-similarity searches.
GTalign offers additional computational advantages by providing the option to utilize multiple GPUs for computation.This feature was effectively leveraged for processing the SCOPe40 2.08, PDB20, and Swiss-Prot datasets, where GTalign exploited the computational power of all three Tesla V100 GPUs available on the system.Supplementary Table S1 provides GTalign runtimes on one, two, and three GPUs, demonstrating scalability across all benchmarked parametrizations.Furthermore, the results presented in Supplementary Section S1.2 and Supplementary Table S5 unveil a noteworthy performance trend: A more recent desktopgrade GPU consistently outperforms the computational capabilities of the three server-grade V100 GPUs, effectively conveying GTalign's remarkable performance even when run on a single, relatively inexpensive GPU.

Alignment accuracy evaluation using RMSD and GDT TS
TM-score, used to evaluate alignment accuracy in the previous subsection, is a global measure sensitive to alignment coverage due to normalization by protein length.Here, we turn to root-mean-squared deviation (RMSD) and global distance test (GDT) [20] for evaluation.RMSD, a measure of spatial proximity, is normalized by the number of aligned residue pairs and is effective at capturing accurately aligned local protein regions (see Supplementary Fig. S6 in Supplementary Section S1.3 for an example).However, optimizing alignments based solely on RMSD can yield short aligned fragments that provide limited insight into structural similarity at the domain or protein level.Previous approaches [14,15,20] sought to find a balance between RMSD and alignment coverage to generate alignments sufficiently long to assess structural similarity without being overly divergent.
The GDT score (GDT TS) [20] is another measure of spatial proximity, calculated at four different distance thresholds (1, 2, 4, and 8 Å), which does not over-penalize spatially unmatched residue pairs.In this section, our RMSD and GDT TS-based evaluation focuses on local alignments within alignment boundaries, providing insight into the extent to which alignments can be shortened to increase local precision by reducing alignment coverage.The results (Fig. 4) reveal that even among local alignment methods such as Foldseek (default parametrization), FATCAT, and DeepAlign, only DeepAlign produces alignments with lower RMSDs and higher GDT TS scores than GTalign (Supplementary Table S6).However, DeepAlign achieves 15% and 18% lower RMSDs (2.48 and 2.03 vs. 2.91 and 2.47 for the SCOPe40 2.08 and PDB20 datasets) with 20% and 17% fewer aligned residues on average (56.0 and 67.4 vs. 47.6 and 57.1) compared to GTalign (--speed=0).The average TM-scores of GTalign alignments (Supplementary Table S1) exceeding those of DeepAlign alignments suggest that the difference in the number of aligned residues does not imply misaligned pairs.Indeed, Supplementary Table S1 and Supplementary Table S6 demonstrate that both GTalign and TM-align produce alignments that strike a good balance between coverage and precision.On average, GTalign (--speed=0) achieves lower RMSDs with higher alignment coverage compared to TMalign.In some cases, this is achieved by identifying different, more optimal spatial superpositions (see

Benchmarking against the SCOPe dataset reference
GTalign's main objective is to achieve an optimal superposition for a pair of structures for inferring their structural similarity, not rarely indicating an evolutionary relationship.This section assesses GTalign and the other tools from an evolutionary standpoint by examining their ability to replicate SCOPe classification.Importantly, this benchmark does not measure alignment accuracy or the rate of accurate alignments (Figs.1-3) but rather the consistency between a tool's ranking of structure pairs and SCOPe classification.
The SCOPe knowledgebase categorizes protein domains into families, superfamilies, folds, and classes.
Families group domains based on sequence similarity, with those sharing a common ancestor organized into superfamilies.Folds comprise structurally similar superfamilies, while classes are arranged by secondary structure content and organization [29].Fig. 5 shows the relationship between the precision and recall (PR) of matching domains of the same SCOPe 2.08 family, superfamily, and fold for each tool.As discussed in the next subsection, significant structural similarities extend even across folds and classes.Consequently, Fig. 5a  folds [35].Additionally, Fig. 5b shows the PR curves when disregarding cross-fold relationships, with FPs corresponding to pairs from different SCOPe 2.08 folds.
The results in Fig. 5 and the areas under the PR curves (AUPRCs) reported in Supplementary Table S7 (Supplementary Section S1.4) demonstrate that GTalign (variants --speed=0 and --speed=13) generally outperforms the other tools, except for Dali in the evaluation that ignores cross-fold relationships at the fold level, where the difference in AUPRC is <1%.When evaluating sensitivity in identifying related domains before encountering the first FP (Supplementary Fig. S7), GTalign shows lower average sensitivity compared to both TM-align versions at the family and fold levels, and compared to Dali and DeepAlign at the superfamily level.However, these differences in the distributions of sensitivity values are statistically insignificant (Supplementary Table S8).
Second, disregarding cross-fold relationships considerably affects precision, recall, and AUPRCs, rendering evaluations unstable.Further discussion on this aspect follows in the subsequent subsection.

Limitations of SCOPe-based evaluation
In the SCOPe classification, the traditional emphasis on organizing proteins according to their sequence similarity to classified entries [36] reflects a primary focus on evolutionary relationships.However, structural similarities among proteins extend beyond evolutionary connections [37].Classifying proteins into discrete folds presents challenges due to inherent ambiguity in defining folds [38,39].For example, what level of insertions or deletions can be considered critical for classifying a domain into a different fold?A more nuanced perspective suggests that protein fold space may exhibit both discrete and continuous characteristics [37,39].While high structural or evolutionary similarities may support discrete fold assignments, lower yet significant similarities imply a continuous nature of fold space.
Our analysis supports these assertions.Examination of the distribution of statistically significant TMscores, obtained by aligning query and subject structures from the SCOPe40 2.08 dataset using TM-align, reveals that only highly similar protein domains (TM-score > 0.8) within the same folds outnumber those from different folds (Fig. 6, Supplementary Fig. S8).Domains from different folds span the entire TM-score significance range, with all of them deemed errors (false positives) despite evident structural and topological similarities (Fig. 6).
These insights underscore the limitations of the SCOPe-based evaluation.While the reference SCOPebased evaluation provides a convenient approach for benchmarking structure alignment tools, its inability to capture the full complexity of protein structure space is evident.

Conclusion
GTalign provides an efficient solution for searching vast protein structure datasets at different levels of accuracy.Its high efficiency is exemplified by a speedup of 6 orders of magnitude over TM-align when aligning large protein complexes (Supplementary Fig. S9).GTalign's cross-platform implementation, userfriendly interface, and high configurability, including the option for clustering structures (Supplementary Section S1.6), underscore its accessibility and versatility.Providing orders of magnitude in speedup at state-of-the-art accuracy, GTalign positions itself as a valuable tool among existing structure aligners.

Structure representation
GTalign offers users the flexibility to configure and choose which protein structure atoms will serve as representatives.By default, protein structures are represented using alpha-carbon atoms.All experiments conducted with GTalign were performed using alpha-carbon atoms as representatives.

Algorithm outline
The GTalign software takes as inputs query and subject (referred to as "reference" in the software) structure databases of arbitrary size.GTalign processes this data in chunks, aligning batches of query structures with batches of subject structures, both sorted by length, iteratively until all possible batch pairs are completed.
A similar batch-oriented approach to processing large databases has been described previously [27].Below, we outline the (sequential) algorithmic steps representing the actions performed on a pair of batch query and subject structures.
Certain steps in the algorithm involve the alignment refinement procedure described in Algorithm 10 (RefineBestAlignments) specified in Supplementary Section S2.Algorithm 10 optimizes TM-scores and refines alignments by considering differently positioned alignment fragments of different lengths.It takes three parameters: the numbers of query and subject structures, where all possible pairs are processed in parallel, and a gap opening penalty for the COMER2 DP algorithm [27] to generate alignments that optimize TM-scores given the superpositions.The complete outline of the GTalign algorithm is provided below.
1. Index query and subject structures and store the spatial indices in a k-d tree data structure.
2. Assign secondary structure states to the structures at each residue in parallel.This assignment is determined by the coordinates of five residues centered around the residue under consideration, with distance cutoffs between residues optimized in [21].
3. Calculate transformation matrices based on continuous fragment pairs in parallel for all queries and subjects, their matched fragment pairs, and fragment positions.
4. Apply Algorithm 10 with parameters (n Q , n S , 0) to refine alignments obtained from the superpositions found in the previous step to maximize the TM-score (always normalized by the length of the shorter protein).n Q and n S are the numbers of query and subject structures in batches.Here and in the following steps, keep track of the maximum TM-score and the corresponding transformation matrix for all querysubject pairs. 5.If the option --add-search-by-ss is specified, apply Algorithm 10 with parameters (n Q , n S , {−0.6, 0}) to refine alignments obtained from the application of the COMER2 DP algorithm [27] using a scoring function based on secondary structure matching and sequence similarity score [40].
6. Apply Algorithm 1 (DeepSuperpositionSearch) to find the most favorable superpositions through a deep search using spatial indices.The search depth is controlled with the --speed option.This step is central to the GTalign method because it enables rapid exploration of the superposition space, resulting in accurate alignments.We provide a detailed specification of Algorithm 1 in Supplementary Section S2.
7. Apply Algorithm 10 with parameters (n Q , n S , {−0.6, 0}) to refine alignments obtained from the application of the COMER2 DP algorithm using a scoring function based on secondary structure matching and TM-score, given the optimal transformation matrices obtained so far.
8. Apply Algorithm 10 with parameters (n Q , n S , {−0.6, 0}) to refine alignments obtained from the application of the COMER2 DP algorithm using TM-score as a scoring function.Here, the number of repetitions in Algorithm 10 is configurable (option --convergence).9. Produce final alignments using the COMER2 DP algorithm based on the optimal transformation matrices in parallel for all queries and subjects.10.Calculate TM-scores, root-mean-squared differences (RMSDs), and other alignment-related statistics in parallel for all queries and subjects.
Steps 1 and 2 prepare data for processing.Steps 3 and 4 identify protein superpositions by matching continuous protein segments, similar to TM-align's initial gapless matching [21].These steps are sufficient to capture optimal superpositions of proteins sharing high structural similarity over a significant fraction of the length of at least one protein of a pair.

Spatial index data structure
To accelerate the superposition search, protein structures are initially indexed (step 1 of the algorithm outline).Each structure's index is stored in a k-d tree data structure, which hierarchically organizes protein atom coordinates.This organization allows for the retrieval of the nearest neighbor in the tree for a query atom with specified coordinates in constant, O(1) time.

Accelerated superposition search using spatial indexing
The accelerated superposition search process (step 6 of the algorithm outline) leverages spatial indexing to find optimal superpositions for query and subject proteins within a data chunk.Conducted in parallel for all query-subject protein pairs in the chunk, this process explores numerous initial superposition configurations per protein pair simultaneously.
Initially, the process calculates initial superpositions, or transformation matrices, based on continuous query and subject protein fragments spanning the entire extent of both proteins.The search depth, determining the number of superpositions to explore, and the fragment length depend on the the query and subject protein lengths, with the fragment length not exceeding 100 residues.Regions with low local secondary structure similarity between query-subject fragment pairs avoid the calculation and exploration of initial superpositions.
Upon completing initial superpositions, the shorter protein undergoes transformation to obtain spatial overlays.Then, alignments are generated between the shorter protein and the other using the longer protein's index in parallel over residues, achieved in constant time complexity.This routine repeats twice: Initially produced alignments refine spatial overlays, followed by repeated alignment production using the protein index while ensuring matching protein secondary structure this time.
The most favorable alignment for the query-subject pair is then selected based on the highest TM-score.However, alignments obtained using spatial indices are sequence order-independent.Therefore, approximate sequence order-dependent TM-scores are computed from these alignments, with one structure transformed, in sub-linear time considering a maximum of 512 aligned residues.
Next, a small subset of transformation matrices with the highest approximate scores is chosen for TMscore calculation using the COMER2 DP algorithm.Further refinement involves selecting an even smaller subset of transformation matrices corresponding to the highest TM-scores to optimize alignments, considering different-length and differently positioned alignment fragments.Finally, the best alignment for the query-subject pair is selected and refined similarly, employing full DP and TM-score optimization.
A detailed specification of this procedure is provided in Supplementary Section S2.

Calculation of rotation matrices
GTalign computes rotation matrices using the Kabsch algorithm [41,42].Solving for the eigenvalues and eigenvectors of a cross-covariance matrix, K ∈ R 3×3 (R in the original notation), requires double-precision arithmetic.To render the problem solvable in single precision, thereby boosting instruction and memory throughput, K is normalized by the mean of the absolute values of its elements.It is easy to show that this operation corresponds to scaling the coordinates of protein atoms.Effectively, rotation matrices for large proteins can be considered as obtained using coordinates expressed in nanometers instead of Angstroms, preventing single-precision arithmetic overflow and underflow.The resulting rotation matrices exhibit an insignificant error (on the order of 10 −5 on average) with no discernible impact on superposition and structural alignment while still ensuring high performance.

Dynamic programming implementation
A previously published algorithm [27] was employed to implement the dynamic programming (DP) algorithms.The time complexity to calculate DP matrices is O(max q l q + max s ls ), with a constant factor dependent on the number of threads running in parallel.(The computation involves (max q l q + max s ls )/32 iterations of independent and parallelized calculations, executed in O(32) time by GPU threads.)Here, the maximums are taken over the lengths of all query (l q ) and subject ( ls ) proteins in a data chunk.In instances where DP matrix values are only required to update backtracking information, the memory complexity for DP matrices is O(n Q s ls ) [27], with n Q representing the number of query proteins in the chunk.DP matrices are built using a gap opening penalty, while the gap extension cost is set to 0 unless otherwise specified.For optimizing memory usage, match scores are modified before writing to memoryeither by negating them for non-negative alignment scores or subtracting a large constant for potentially negative scores.The reverse operation is subsequently applied upon reading from memory.This approach minimizes memory requirements, facilitating greater data accommodation and parallelization.

Algorithm efficiency
The high efficiency of the developed algorithms, particularly the spatial indices, is best demonstrated by aligning large protein structures.For instance, GTalign only took seconds to align and provide a superposition for two virus nucleocapsid variants, 7a4i and 7a4j (37,860 residues each), featuring different chain orders on a single Tesla V100 GPU (Supplementary Fig. S9).In contrast, aligning these complexes using TM-align took more than three months.Although TM-align is not typically used for aligning complexes, GTalign's efficiency may open additional possibilities for exploring large complexes when chain order preservation is important.

Performance improvement potential
GTalign's efficiency can be further enhanced by considering three key aspects.First, GTalign currently uses 32-bit floating-point precision (FP32) operations.Exploring the adoption of 16-bit (FP16) or even 8-bit (FP8) floating point precision before the final stages of alignment has the potential to increase the degree of parallelization by 2 to 4-fold.
Second, the COMER2 DP algorithm, a critical component for accuracy, is employed several times throughout the structural alignment search procedure.Substituting it with spatial matching, as outlined in Supplementary Section S2, at all intermediate stages and reserving it solely for the final alignment stage could result in a significant speedup.
Finally, the third aspect involves similarity selection on the coarse scale.By encoding structures with embeddings and utilizing indexed vector databases [43], GTalign could achieve nearly instantaneous selection of similar protein candidates and a constant-time database search and alignment, regardless of the database size.

Prescreening for similarities in sequence and structure space
GTalign allows for an initial screening in the sequence space (option --pre-similarity) to identify potential similarities before engaging in more detailed structural analysis.The implementation of this procedure is based on calculating local ungapped alignment scores between protein sequences using a sequence similarity score table [40] and does not involve dynamic programming.Protein pairs with alignment scores exceeding a specified threshold progress to the subsequent stages of structural analysis.
Additionally, an initial screening for similarities is available in the structure space using the --pre-score option.With this option, protein pairs with provisional TM-scores, obtained in step 4 of the algorithm outline, lower than a specified threshold are excluded from further processing.

GTalign software
GTalign incorporates several key features that contribute to its versatility and user-friendly nature.Developed using the OpenMP standard for CPUs and CUDA architecture for GPUs, GTalign is compatible with various computing architectures, including NVIDIA Pascal, Turing, Volta, Ampere, Ada Lovelace, and subsequent GPU architectures.(The GPU version exhibits a 10-20x increase in speed.)Its independence from external packages ensures seamless operation across different compilers (GCC, LLVM/Clang, MSVC) and their respective versions.GTalign is cross-platform software, with binary packages precompiled for Linux and Windows x64 operating systems.For other platforms, users have the flexibility to compile GTalign from its source code.GTalign usage is straightforward: No structure database preprocessing is required.Users can effortlessly employ GTalign by directly providing files, compressed files (gzip), directories, and/or archives (tar) of protein structures as command-line arguments.This user-centric design enhances accessibility and facilitates streamlined integration into diverse computational environments.

Alignment accuracy evaluation
The evaluation of structural alignment accuracy is based on assessing how accurately the structural alignments of protein pairs translate to spatial agreement in their respective structures.This self-contained evaluation is unbiased, as it does not depend on external classifications, which may be constructed using specific sequence and structure alignment tools.
The superposition of two aligned proteins is evaluated by the TM-score and RMSD, calculated by the established method TM-align [21] using the -I option (Fig. 7).Notably, in this setting, TM-align does not perform a global superposition search but instead optimizes superposition constrained by a given alignment, leaving it unchanged.
GDT TS scores were calculated using the TM-score tool [26], with minimal modifications to the source code to normalize GDT TS by the number of aligned residue pairs.The adapted TM-score code is publicly available.
In the benchmarks, alignments are evaluated based on (i) the TM-score normalized by the length of the shorter protein and (ii) the TM-score normalized by the query length.The first scenario considers all structural similarities, including instances where smaller proteins match regions of larger proteins.The second scenario downgrades the importance of alignments between the query and a much shorter subject protein, providing a more favorable position for some methods (e.g., Dali [22]) as their measures (e.g., Z-score) reduce the significance of such alignments.

SCOPe-based evaluation
This evaluation aimed to assess the ability of the tools to match SCOPe 2.08 [29] domains to families, superfamilies, and folds.True positives (TPs) at the family, superfamily, and fold level were defined as pairs of structures from the same family, the same superfamily but different families, and the same fold but different superfamilies, respectively.Self-matches were excluded.The sizes of these groups are referred to as effective sizes.False positives (FPs) were identified as pairs from different folds.
Precision and recall were calculated as #TP/(#TP + #FP) and #TP/P, respectively, where P represents the total number of positive pairs.The number of TPs, #TP, and P for precision-recall (PR) curves were downweighted by the effective size of family, superfamily, and fold for respective-level calculations.The number of FPs, #FP, was downweighted by the effective fold size.The weighting for counts was consistent with the approach used in [25].
Before conducting sensitivity and PR analyses, alignments generated by the tools were sorted by their significance measure.Foldseek (default parametrization) alignments were sorted by E-value, while FATCAT alignments were sorted by P-value, and Dali alignments by Z-score.DeepAlign alignments were sorted by DeepScore.TM-align and GTalign alignments were sorted by the harmonic mean of the TM-scores normalized by the query and subject lengths.The harmonic mean proved superior to the arithmetic mean for TM-align and GTalign alignments due to its ability to reduce significance for structure pairs with large length differences.However, the arithmetic mean was more suitable for Foldseek --tmalign-fast 1 and --tmalign-fast 0 alignments, as most of such pairs had already been filtered out.
Secondary TM-scores, referred to as 2TM-scores, were introduced to rank GTalign alignments in the SCOPe-based evaluation.The 2TM-score is calculated over the alignment excluding unmatched helices and provided slightly improved results for fold-level evaluations.Options to calculate 2TM-scores (--2tm-score) and rank alignments by the harmonic mean of the TM-scores or 2TM-scores are available starting with version 0.15.0.

The SCOPe40 2.08 dataset
All protein domains from the SCOPe 2.08 [29] database filtered to 40% sequence identity (SCOPe40 2.08), totaling 15,177, were searched with query protein domains selected randomly, one per superfamily, from the same SCOPe40 2.08 dataset.Representatives that Dali [22] failed to reformat for its initial structural representation were omitted, resulting in a total of 2045 queries.
To ensure consistent structure interpretation between TM-align and the other tools, the structure files underwent the following changes: (i) the first model of multi-model files was retained; (ii) the chain identifier was set to 'A' to make a single-chain structure; (iii) residues were renumbered sequentially; (iv) residues lacking at least one of the N, CA, C, and O atoms were removed.HETATM records were disregarded when using Foldseek as its interpretation of these records differed from that of TM-align.
The structure files were preprocessed to ensure consistent structure interpretation across the tools: HETATM records and residues lacking at least one of the N, CA, C, and O atoms were removed.Also, the first model of multi-model files was retained.

The Swiss-Prot dataset
All UniProtKB/Swiss-Prot [31] protein structures (542,378) from the AlphaFold Database [4] were searched with 40 proteins representative of structurally diverse CRISPR-Cas systems [46].The selection of the 40 query proteins followed a specific process: First, the 5831 PDB protein chains associated with CRISPR-Cas systems (downloaded on 10/19/2023) were clustered at a TM-score threshold of 0.4 with a length coverage threshold of 40% using GTalign with options --speed=13 --add-search-by-ss --cls-coverage=0.4--cls-threshold=0.4 --ter=0 --split=2.Subsequently, the top 40 members from every third singleton cluster, sorted by length, were chosen as queries, with an average length of 382 residues.The query structures underwent preprocessing, involving the removal of HETATM records and residues lacking at least one of the N, CA, C, and O atoms.

The HOMSTRAD dataset
The HOMSTRAD dataset, comprising reference structural alignments of protein families and accompanying structure files, was obtained from Ref. [34], containing 398 multiple protein structure alignments from the HOMSTRAD database [32].(The original data were inaccessible.)For benchmarking purposes, each family's first protein from the reference alignments was aligned with every other protein of the same family, resulting in a total of 1722 pairwise alignments.

Computer system configuration
Unless otherwise specified, all benchmark tests were conducted on a server equipped with two Intel Xeon Gold 5115 CPUs @ 2.4 GHz (20 hardware threads per CPU), 128GB DDR4 RAM, and three NVIDIA Tesla V100-PCIE-16GB GPU accelerators, running the CentOS 7 operating system.

Runtime evaluation
The runtimes of all tools were measured by the Linux time command.
When employing the initial screening in the sequence space (option --pre-similarity), options -s 0.3 and -c cachedir were specified.The latter was used unconditionally for the Swiss-Prot dataset.In the SCOPe-based evaluation, the --2tm-score option was specified to calculate 2TM-scores (version 0.15.0).
GTalign calculates and outputs TM-scores normalized by the length of both proteins in a pair.Consequently, the corresponding TM-scores were utilized as GTalign's measures to sort alignments (left panels of Fig. 1b-d and Fig. 2a-c).

TM-align settings
Parallel processing of all queries for each dataset was achieved by iteratively running 40 instances of TM-align [21] version 20220412 simultaneously.For the HOMSTRAD and three other datasets, each process instance was executed with the following options, respectively: <query> <subject_file> -het 1 and <query> -dir2 <db_dir> <lst_file> -het 1.Here, <query> represents a query file, <db_dir> is a directory of subject structure files, and <lst_file> is a list file of all subjects.For the Swiss-Prot dataset, -outfmt 2 was included to reduce disk space usage.The fast version (TM-align -fast) utilized an additional option, -fast.

Dali settings
The standalone version DaliLite.v5[22] was employed in the benchmark tests.Prior to initiating searches, structure files underwent reformatting to an initial representation using the command import.pl--pdbfile <struct_file> --dat <dir> --pdbid <id>.In this command, <struct_file> refers to a structure file, <dir> is a directory for reformatted structures, and <id> is an assigned structure identifier.Reformatting failed for 104 and 525 subject structures from the SCOPe40 2.08 and PDB20 datasets, respectively.The time taken for reformatting was excluded from runtime evaluations.
DeepAlign outputs the TM-score normalized by the length of the shorter protein, which was used as its measure for sorting alignments in the corresponding evaluations (left panel of Fig. 1b and c).When evaluating alignments based on the TM-score normalized by the query length, DeepAlign's DeepScore was utilized as its measure to sort the alignments (left panel of Fig. 2a and b).

FATCAT settings
FATCAT 2.0 [28] searches were conducted iteratively for all queries in three datasets using the rigid structural alignment setting.The command FATCATSearch <query_file> <sbjct_list> -i2 <db_dir> -r -o <output_file> -m was utilized, with <sbjct_list> and <db_dir> representing the list file and the directory of subject structures, respectively.For each query, FATCAT automatically initiated parallel processes corresponding to the number of processors in the system, in this case, 40.FATCAT's P-value served as the sorting measure, i.e., the alignments were sorted in ascending order based on it.It is noteworthy that FATCAT disregards HETATM records, and thus, these records were also omitted during alignment accuracy evaluation with TM-align.
As Foldseek generates TM-scores, the TM-score normalized by the length of the shorter protein was employed as the sorting measure for alignments in the left panel of Fig. 1b-d.When evaluating alignments based on the TM-score normalized by the query length (Fig. 2a-c), E-value and the average TM-score (Foldseek --tmalign-fast 0/1), a recommended metric [25], were utilized as measures to sort the alignments in ascending and descending order, respectively.
For the HOMSTRAD dataset, the command was modified to specify individual query and subject structure files instead of their directories.Also, the option --threads was set to 1, and the additional options --prefilter-mode 2 and -e 1e6 were included.

Figure preparation
Molecular graphics images were generated using UCSF Chimera [47] version 1.14.Plots were created using the ggplot2 package [48] in R [49], versions 3.6.0and 4.3.2.   1) time complexity.GTalign explores numerous superpositions in parallel.Upon obtaining a superposition, the alignment between the query protein (red) and the subject protein (blue) is generated using the subject protein's spatial index.This index allows for the independent retrieval of the nearest residue in the subject protein for each residue in the query protein, enabling parallel processing.b, Benchmarking results on the SCOPe40 2.08 dataset with 2045 queries and 15,177 database entries.Parameterized runs of GTalign, TM-align, and Foldseek are included.The left panel plots the cumulative TM-score (normalized by the shorter protein length) against the number of top alignments ranked by a tool's measure (TM-score, Z-score, or P-value).In the middle panel, the alignments are sorted by their (TM-align-obtained) TM-score.Vertical lines indicate the number of alignments with a TM-score ≥ 0.5.The arrow denotes the largest difference in that number between GTalign (732,024) and Foldseek (13,371).The right panel shows the cumulative TM-score plotted against runtime in seconds.c, Benchmarking results on the PDB20 dataset with 186 queries and 18,801 database entries.d, Benchmarking results on the Swiss-Prot dataset with 40 queries and 542,378 database entries.DeepAlign, Dali, and FATCAT are excluded due to their long projected execution times.The Foldseek curves appear truncated due to the total number of hits it produced.The axes scales in Panels b-d are chosen to accommodate the maximum values of the cumulative TM-score and the number of top hits with a TM-score ≥ 0.5.Source data are provided as a Source Data file.

S1.2 GTalign runtimes on different machines
In assessing the performance of GTalign parameterized variants across diverse hardware configurations, we conducted runtime evaluations on three distinct machines: a server-grade system equipped with three Tesla V100 GPU accelerators, a desktop-grade machine featuring a GeForce RTX 4090 GPU, and a laptop with a GeForce RTX 4090 Laptop GPU.GTalign was configured to use 16GB of GPU RAM (option --dev-mem) for all tests.Notably, the desktop-grade machine, housing a more recent and affordable GeForce RTX 4090 GPU, outpaced the server with three Tesla V100 GPU cards when running GTalign.The detailed runtimes for each GTalign parameterized variant on these diverse machines are presented in Table S5.

S1.6 Clustering protein structures with GTalign
To showcase GTalign's clustering capabilities, we performed a clustering analysis on the entire PDB [2] database (accessed 08/18/2023) of protein structures.Using GTalign with parameters --dev-queries-total-length-per-chunk=1500 --dev-min-length=3 --dev-max-length=1500 --speed=13 --add-search-by-ss --cls-coverage=0.7 --cls-threshold=0.5 -c cachedir and the complete linkage clustering algorithm [3] (single linkage clustering is available too), we clustered protein structures at a TM-score threshold of 0.5 with a length coverage threshold of 70%.The clustering focused solely on the first protein chains, totaling 199,839 chains, and was accomplished in five days, leveraging three Tesla V100 GPUs.The distribution of cluster sizes resulting from this analysis is visualized in Fig. S10a.The significance of clustering in the structural space becomes evident when compared to traditional sequence-based clustering approaches.As demonstrated in Fig. S10b, the superposition of cluster member proteins showcases known structural relationships also detected by GTalign.In one example, a cluster encompassing proteins with the TIM beta/alpha-barrel fold reveals diversity in sequence yet high structural similarity.More than half of the proteins within this cluster share, at most, 7% sequence identity with the cluster's first member.Despite this low sequence identity, their structural alignment yields significant TMscores ranging from 0.553 to 0.708.In contrast, conventional sequence-based clustering would consider such protein pairs as unrelated due to the lack of sequence similarity.This example underscores the effectiveness and utility of fast clustering in the structural space, a feature offered by GTalign.[5].Requiring no backtracking information in line 16 reduces memory demands and increases the parallelization degree.A further refinement is made by selecting (line 17) an even smaller number (n brn < n tfm ) of transformation matrices corresponding to the highest TM-scores to optimize structural alignments derived from them (line 18).Finally, the alignments-each representing one query-subject pair-with the highest TM-score are refined (lines [19][20].for all (q, s, f ext ) ∈ [0, n Q ) × [0, n S ) × [0, n ext ) do in parallel 3:

Algorithm 2 Calculate Local Similarity
Calculate query and subject protein positions p q and ps from index f ext 4: if p q + l f > l q or ps + l f > ls then for all i ∈ [0, min{96, l q − p q }) do in parallel [ S i mod 32,j mod 32 ← max{S i mod 32,j mod 32 , L q,s,pq+i, ps+j } ] mut for all (q, s, f ext ) ∈ [0, n Q ) × [0, n S ) × [0, n ext ) do in parallel Calculate query and subject protein positions p q and ps from index f ext 5: (K qsfext , c qfext , csfext ) ← l f −1 i=0 (C q,pq+i CT s, ps+i , C q,pq+i , Cs,ps+i ) ⊲ parallel sum reduction for all (q, s, f ext ) ∈ [0, n Q ) × [0, n S ) × [0, n ext ) do in parallel if b dyn = 1 and l q ≥ ls then Calculate R qsfext by the Kabsch algorithm [6,7] based on (K qsfext , c qfext , csfext ) for all (q, s, f ext ) ∈ [0, n Q ) × [0, n S ) × [0, n ext ) do in parallel Calculate query and subject protein positions p q and ps from index f ext 5: l f ← min{512, l q , ls } 6: if l q ≥ ls then ⊲ always search in the larger structure r ← max{0, min{l q − l f , p q − l f /2}} 11: D ← C q,r:r+l f −1 ; D ← Cs ; θ ← T q,r:r+l f −1

5:
Load (D i , Di , •) l f −1 i=0 from memory at (q, s, f ext , i) for all (q, s, f ext ) ∈ [0, n Q ) × [0, n S ) × [0, n ext ) do in parallel l f ← min{512, l q , ls } 5: x ← −1 512 ; m ← 0 512 6: Load (R qsfext , t qsfext ) from memory for all i ∈ [0, l f ) do in parallel 8: Load (D i , Di , j) from memory at (q, s, f ext , i) for i = 0, . . ., l f − 1 do 13: ω ← max j:xj <zi m j ⊲ parallel max reduction if x c < 0 or (x c = z i and m c < ω + a i ) or ((i > l f /2)? x c < z i : x c > z i ) ⊲ heuristics upon hash collision then for all (q, s) ∈ [0, n Q ) × [0, n S ) do in parallel x ′ ← Sort((w qsfext ) fext ) ⊲ Batcher's sort [8] in O(log 2 2 n ext ) time for all (f ext ) ∈ [0, n ext ) do in parallel ⊲ approx.to partial sorting displays the PR curves obtained with false positives (FPs) as pairs of domains from different folds, excluding well-known crossfold relationships across Rossman-like (c.2-c.5, c.27, c.28, c.30, and c.31) and beta-propeller (b.66-b.70) Step 5 occasionally improves superpositions found in the previous steps.Step 6 conducts an extensive superposition search by matching different protein spatial regions.Steps 7 and 8 represent the refinement of transformation matrices and related alignments obtained earlier, meaning that alignment regions typically do not change or change slightly.Steps 9 and 10 prepare results for output.All the steps are based on algorithms and data structures designed to maximize instruction and memory throughput.

Fig. 1
Fig.1Results.a, Illustration of matching protein structures with O(1) time complexity.GTalign explores numerous superpositions in parallel.Upon obtaining a superposition, the alignment between the query protein (red) and the subject protein (blue) is generated using the subject protein's spatial index.This index allows for the independent retrieval of the nearest residue in the subject protein for each residue in the query protein, enabling parallel processing.b, Benchmarking results on the SCOPe40 2.08 dataset with 2045 queries and 15,177 database entries.Parameterized runs of GTalign, TM-align, and Foldseek are included.The left panel plots the cumulative TM-score (normalized by the shorter protein length) against the number of top alignments ranked by a tool's measure (TM-score, Z-score, or P-value).In the middle panel, the alignments are sorted by their (TM-align-obtained) TM-score.Vertical lines indicate the number of alignments with a TM-score ≥ 0.5.The arrow denotes the largest difference in that number between GTalign (732,024) and Foldseek(13,371).The right panel shows the cumulative TM-score plotted against runtime in seconds.c, Benchmarking results on the PDB20 dataset with 186 queries and 18,801 database entries.d, Benchmarking results on the Swiss-Prot dataset with 40 queries and 542,378 database entries.DeepAlign, Dali, and FATCAT are excluded due to their long projected execution times.The Foldseek curves appear truncated due to the total number of hits it produced.The axes scales in Panels b-d are chosen to accommodate the maximum values of the cumulative TM-score and the number of top hits with a TM-score ≥ 0.5.Source data are provided as a Source Data file.

Fig. 2 SCOPeFig. 4 GTalignFig. 5
Fig. 2 Alignment evaluation results based on the TM-score normalized by the query length.a, Benchmarking results on the SCOPe40 2.08 dataset with 2045 queries and 15,177 database entries.The left panel plots the cumulative TM-score (normalized by the query length) against the number of top alignments ranked by a tool's measure (TM-score, Z-score, or P-value).In the middle panel, the alignments are sorted by the TM-align-obtained TM-score.Vertical lines denote the number of alignments with a TM-score ≥ 0.5.The arrow highlights the largest difference in that number between GTalign (492,887) and Foldseek (10,375).The right panel shows the cumulative TM-score plotted against runtime in seconds.b, Results on the PDB20 dataset with 186 queries and 18,801 database entries.c, Results on the Swiss-Prot dataset with 40 queries and 542,378 database entries.Excluding DeepAlign, Dali, and FATCAT due to their long projected execution times.Source data are provided as a Source Data file.
Fig.5Weighted precision-recall (PR) curves at the family, superfamily, and fold levels.The areas under these PR curves are reported in Supplementary TableS7.a, False positives for calculating precision and recall are pairs of structures from different SCOPe 2.08 folds, with the exception that those pairs belonging to Rossman-like (c.2-c.5, c.27, c.28, c.30, and c.31) or betapropeller (b.66-b.70)folds are ignored[35].b, False positives correspond to pairs from different SCOPe 2.08 folds without exceptions.Source data are provided as a Source Data file.
Fig.5Weighted precision-recall (PR) curves at the family, superfamily, and fold levels.The areas under these PR curves are reported in Supplementary TableS7.a, False positives for calculating precision and recall are pairs of structures from different SCOPe 2.08 folds, with the exception that those pairs belonging to Rossman-like (c.2-c.5, c.27, c.28, c.30, and c.31) or betapropeller (b.66-b.70)folds are ignored[35].b, False positives correspond to pairs from different SCOPe 2.08 folds without exceptions.Source data are provided as a Source Data file.

Fig. 6 scenario 1 :
Fig. 6 Distributions of TM-scores (> 0.5) for domain pairs within the same and different SCOPe 2.08 folds, presented in bins of width 0.1.TM-scores, normalized by the length of the shorter protein, were calculated by aligning query and subject structures from the SCOPe40 2.08 dataset using TM-align.Representative examples of structure pairs from different folds and classes are provided for each TM-score bin.These examples, along with numerous other significantly structurally similar pairs from different folds and classes, are considered errors (false positives) in the reference SCOPe-based evaluation.Source data are provided as a Source Data file.

Fig. 7
Fig.7Schematic for benchmarking structure alignment tools.The entire procedure can be described as follows: (i) Run a structure alignment tool; (ii) Use TM-align to calculate the TM-score for each produced alignment; (iii) Sort the alignments by the tool's measure (e.g., P-value, Z-score, etc.); (iv) Additionally, sort the alignments by the TM-score calculated by TMalign; (v) Finally, calculate the cumulative TM-score for the results, considering both the sorting by the tool's measure and the sorting by the TM-align-obtained TM-score.This provides a comprehensive measure of how accurately the tool produces alignments and their rate.It's worth noting that seemingly subtle differences in cumulative TM-score can be significant, especially considering the narrow gap between successive TM-scores; a mere 0.2 difference can distinguish between an accurate and inaccurate alignment.

GTalign −− speed=0 4 GTalign×10 3 )
Fig. S1 Benchmarking results on the SCOPe40 2.08, PDB20, and Swiss-Prot datasets.The figure plots the cumulative TMscore against the number of top alignments sorted by TM-align-obtained TM-score and represents zoomed-in sections of Fig. 1 and 2 from the main text for detailed analysis.Vertical lines indicate the number of alignments with a TM-score ≥ 0.5.Panels (a) and (b) show TM-scores normalized by the length of the shorter protein and by the query length, respectively.Source data are provided as a Source Data file.

Fig. S2
Fig. S2 Structural alignment examples from benchmarking on the SCOPe40 2.08 dataset.Query structures are in grey, and subject structures are colored.The subject structures share the same orientation across the tools for visual inspection.Numbers represent TM-scores normalized by the length of the shorter protein.It's noteworthy that some superpositions appear similar despite having different TM-scores.In such instances, TM-align identifies a favorable superposition even when provided with an inferior alignment.Foldseek (all parameterized versions) did not produce alignments for the given protein pairs.

Fig. S3
Fig. S3 Illustrative examples from benchmarking on the PDB20 dataset.Query structures are in grey, and subject structures are colored.The subject structures share the same orientation across the tools.Numbers represent TM-scores normalized by the length of the shorter protein.Foldseek (all parameterized versions) did not produce alignments for the given protein pairs.

Fig. S4
Fig. S4 Illustrative examples from benchmarking on the Swiss-Prot dataset.Query structures are in grey.Subject structures are colored.The subject structures share the same orientation across the tools.Numbers represent TM-scores normalized by the length of the shorter protein.Foldseek (all parameterized versions) did not produce alignments for the given protein pairs.

S1. 3
Fig. S6 Illustrative example of alignment accuracy evaluation for the SCOPe 2.08 domains d12asa (d.104.1.1;shown in grey) and d1seta2 (d.104.1.1;shown in color) using TM-score normalized by the length of the shorter protein and RMSD.TMscore is sensitive to alignment coverage due to normalization by protein length.RMSD, on the other hand, is normalized by the number of aligned residue pairs and measures their spatial proximity after superposition, similar to TM-score.Therefore, a short accurately aligned fragment (right panel; unaligned d1seta2 parts not shown) implies a lower RMSD compared to the full alignment (left panel), which includes more divergent structural segments.Alignments, TM-scores, and RMSDs were calculated using TM-align.

Fig. S8
Fig. S8 Distributions of TM-scores (> 0.5) for domain pairs within the same and different SCOPe 2.08 folds, presented in bins of width 0.1.TM-scores, normalized by the query length, were calculated by aligning query and subject structures from the SCOPe40 2.08 dataset using TM-align.Source data are provided as a Source Data file.
Fig.S10Results of the GTalign clustering analysis on the entire PDB database.a, Distribution of cluster sizes.b, Superimpositions of proteins within selected clusters, representing the Carbonic anhydrase fold (left panel) and the TIM beta/alpha-barrel fold (right panel).Despite variations in sequence identity (21-35% with the first member of the first cluster and at most 7% for the second cluster, excluding trivial matches), GTalign reveals significant structural similarity within clusters.Superpositions were found by aligning the first member of each cluster with the rest using GTalign and the option --referenced, producing transformation matrices for subject structures.Source data are provided as a Source Data file.

5 :
Set skip flag for configuration {q, s, f ext }

m 16 : 3
← max i,j S ij ⊲ two-dimensional parallel reduction 15: if m < l f × c thr then ⊲ fragment length fraction as a similarity threshold Set skip flag for configuration {q, s, f ext } Calculate cross-covariance matrices 1: procedure CalculateXCovariances(n Q , n S , n ext , l f , C, C) 2:

3 :
continue if skip flag is set for configuration {q, s, f ext } 4:

3 :
continue if skip flag is set for configuration {q, s, f ext } 4:

14 :
for all i ∈ [0, l f ) do in parallel 15:D ′ i ← R qsfext D i + t qsfext ⊲ transformation 16:j ← nearest neighbour in D for D ′ i using index with (b SSM = 1) or without (b SSM = 0) information θ i ⊲ O(1) time complexity 17:

3 :
continue if skip flag is set for configuration {q, s, f ext } 4:

3 :
continue if skip flag is set for configuration {q, s, f ext } 4:

Table S1
Runtimes and cumulative TM-score, normalized by the length of the shorter protein, for each dataset.Additionally, runtimes are provided for GTalign evaluated on one, two, and three Tesla V100 GPUs.Numbers in parentheses indicate the maximum number of top hits evaluated.Table S2 Number of top hits at different levels of TM-score, normalized by the shorter protein length, for each dataset

Table S3
Cumulative TM-score, normalized by the query length, for each dataset.Numbers in parentheses indicate the maximum number of top hits evaluated.

Table S7
Area under the weighted precision-recall curve (AUPRC) at the family, superfamily, and fold level.The results are presented in two parts.In one part, false positives for calculating precision and recall correspond to pairs of structures from different SCOPe 2.08 folds (No exceptions).In the second part, false positives are pairs from different SCOPe 2.08 folds; however, these pairs are ignored if the structures belong to Rossman-like (c.2-c.5, c.27, c.28, c.30, and c.31) or beta-propeller (b.66-b.70)folds[1](Cross-foldrelationshipsignored).TableS8Average sensitivity up to the first false positive at the family, superfamily, and fold levels.Sensitivity represents the fraction of true positives identified before the first false positive is encountered.In one analysis, false positives are defined as pairs from different SCOPe 2.08 folds, ignoring those pairs from Rossman-like (c.2-c.5, c.27, c.28, c.30, and c.31) or beta-propeller (b.66-b.70)folds(Cross-foldrelationships ignored).In the second analysis, false positives are defined as pairs from different SCOPe 2.08 folds (No exceptions).P -values in parentheses are obtained from the nonparametric unpaired two-sided Mann-Whitney test on the sensitivity values (Fig.S7) between GTalign --speed=0 and all other parameterized tools at the family, superfamily, and fold levels.