Protein language model powers accurate and fast sequence search for remote homology

Homologous protein search is one of the most commonly used methods for protein annotation and analysis. Compared to structure search, detecting distant evolutionary relationships from sequences alone remains challenging. Here we propose PLMSearch (Protein Language Model), a homologous protein search method with only sequences as input. With deep representations from a pre-trained protein language model to predict similarity, PLMSearch can capture the remote homology information hidden behind the sequences. Extensive experimental results show that PLMSearch can search millions of query-target protein pairs in seconds like MMseqs2 while increasing the sensitivity by more than threefold, and is comparable to state-of-the-art structure search methods. In particular, unlike traditional sequence search methods, PLMSearch can recall most remote homology pairs with low sequence similarity but sharing similar structures. PLMSearch is freely available at https://dmiip.sjtu.edu.cn/PLMSearch.

) and SCOPe40 to Swiss-Prot (Extended 133 Data Table A3).PLMSearch achieves the best 134 performance in both sets of search tests against 135 Swiss-Prot, especially at P@100, which is more 136 intuitive to users (P@100 means how many correct   2).This is due to the fact

PLMSearch accurately detects remote homology pairs
We conducted a specific analysis of recalled pairs and missed pairs of different search methods in the Swiss-Prot to Swiss-Prot search test (Fig. 3).We calculated the TM-score and the sequence identity of protein pairs (see "Sequence identity calculation" Supplementary Section).In this paper, pairs with similar sequences and similar structures are defined as sequence identity > 0.3 [38] and TMscore > 0.5 [39, 40] and are called "easy pairs"; pairs with dissimilar sequences but similar structures are defined as sequence identity < 0.3 but TM-score > 0.5 and are called "remote homology pairs" (Fig. 4).
Compared with easy pairs, remote homology pairs in the "twilight zone" of protein sequence homology are more difficult to find [38] 5).We compared the method based on  7).The baselines contain PfamClan, which will filter out pairs containing the same Pfam clan domain.As a comparison, Euclidean, COS, SS-predictor (w/o COS), SS-predictor, and PLMSearch take the first 10,000 pairs with the highest similarity as the results.
(2) Based on the same embeddings, SS-predictor performs better than other similarity-based search methods (boosting F1-score by 160% and 156% compared with Euclidean and COS, respectively).
We further compared the correlation between the predicted similarity and TM-score (Extended Data Fig.A1).The correlation between the similarity predicted by Euclidean (COS) and TMscore is not high, resulting in a large number of actually dissimilar protein pairs ranking first.
The similarity predicted by SS-predictor is more correlated with TM-score (with a higher Pearson correlation coefficient and a higher Spearman correlation coefficient, see Supplementary Table   9).

401
It is worth noting that the recall rate is is a key component of bioinformatics methods used in protein function prediction [1-5], protein-protein interaction prediction [6], and protein-phenotype association prediction [7].The goal of homologous protein 6 search is to find the homologous protein from the 7 target dataset (generally a large-scale standard 8 dataset) for each query protein.And the target 9 protein with a higher degree of homology should 10 be ranked higher.According to the type of input 11 the TM-score benchmark (see "Evaluation based 129 on the TM-score benchmark"), both with Swiss-130 Prot as the target dataset.The two sets of search 131 tests are Swiss-Prot to Swiss-Prot (Extended Data 132 Table 137query results the user can get on average among 138 the top 100 results).
running times on the three 143 search tests (Extended Data Table A4).By using 144 PfamClan to pre-filter the protein pairs and SS-145 predictor to predict the similarity, instead of 146 calculating the structural similarity (TM-score) 147 of all protein pairs from scratch, PLMSearch is 148 more than four orders of magnitude faster than 149 TM-align in large-scale searches.150 It is worth noting that PLMSearch can achieve 151 similar efficiency even with CPU ONLY (Sup-152 plementary Table

153that
PLMSearch calculates and preloads the deep 154 embedding of all proteins in the target dataset in 155 advance.This strategy helps to save much time by 156 avoiding repeated forward propagations of a pre-157 trained model with a large number of parameters.158 The efficiency of PLMSearch also benefits from 159 the fact that it only needs to calculate the protein 160 pairs pre-filtered by PfamClan instead of calculat-161 ing all pairs from scratch like SS-predictor, which 162 makes PLMSearch more than 8 times faster than 163 SS-predictor in large-scale scenarios (Supplemen-164 tary

182
many pairs need to be searched.PLMSearch excels 183 in this common scenario.184 In addition, when we evaluate structure search 185 methods, we assume that the protein structure has 186 been prepared in advance.In fact, although the predicted structures of most of the proteins in the UniProt dataset can be downloaded directly from AlphaFold protein structure database (AFDB) [21, 23], there are still exceptions.In scenarios containing a large number of new sequences, the time to prepare structural data needs to be considered when using structure search methods.Using sequence search methods such as PLMSearch does not require this part of the time.

2252. 4
PfamClan and SS-predictor 226 make PLMSearch more robust 227 We first performed an ablation experiment based 228 on the SCOPe benchmark on the SCOPe40-test 229 search test (Supplementary Fig. 1, Supplemen-230 tary Table

231
Fig.2, Supplementary Table7).The baselines 8 ).Moreover, SS-predictor increases the robustness and reliability in extreme cases by multiplying the predicted TM-score by the COS distance 276 between protein embeddings as the final similarity 277 (Supplementary Table figure), making PLMSearch more sensitive and 287 296 2.5 PLMsearch, like TM-align, pays 297 more attention to the global 298 similarity 299 We analyzed the wrong pairs of different search 300 methods in the Swiss-Prot to Swiss-Prot search 301 test (Supplementary 311 11).As reported in Foldseek's paper, TM-align 312 does not consider these pairs to share similar 313 structures, as TM-align looks for global structural 314 superpositions.So the TM-scores of these pairs 315 are low (TM-score(Default)<0.5, TM-score(Avg.316 length)< 0.15).Nevertheless, Foldseek filters them 317 out because it pays more attention to local sim-318 ilarity.Like TM-align, the similarity calculated 319 by PLMSearch is all less than the reference sim-320 ilarity 0.3, so PLMSearch also considers their 321 structures not similar (see "Reference similar-322 ity of PLMSearch" Supplementary Section and 323 Supplementary

347371
In the future, we will probe removing Pfam-348 Scan [41] from the pre-filtering process, trans-349 forming the entire PLMSearch into a pure deep-350 learning model.Specifically, the current PLM-351 Search must find protein pairs that belong to the 352 same Pfam clan based on PfamScan results.We 353 hope to use the machine learning model to predict 354 the fold the protein belonging to and form protein 355 pairs between proteins belonging to the same fold, 356 thereby saving the time required for PfamScan.357 Furthermore, rather than using a general-purpose 358 large-parameter protein language model like ESM-359 1b, we will try to use the pre-training method to 360 fine-tune a small-parameter protein model with 361 fewer parameters but better for protein search, 362 making PLMSearch lighter and faster.removed the low sensitivity limitations of sequence 365 search methods.Since the sequence is more appli-366 cable and easier to obtain than structure, PLM-367 Search is expected to become a more convenient 368 homologous protein search method.PLMSearch consists of three steps (Fig. 1).(1) 372 PfamClan.We use PfamScan [41] to identify 373 Pfam clan domains in query protein sequences 374 and search a target dataset for proteins sharing 375 the same Pfam clan domain.In addition, a lim-376 ited number of query proteins lack any Pfam clan 377 domain, or their Pfam clans differ from any tar-378 get protein.To prevent such queries from yielding 379 no results, all pairs between such query protein 380 and target proteins will be retained.(2) Similar-381 ity prediction.We generate embeddings contain-382 ing remote homology information using a protein 383 language model.Subsequently, we use the SS-384 predictor to predict the similarity of each pair.(3) 385 Search result.Finally, we sort the protein pairs 386 based on their similarity, and output the search 387 results for each query protein accordingly.388 For the filtered protein pairs, if TM-align 389 structure alignment is required, users can use the 390 parallel architecture provided by us (Supplemen-391 tary Fig. 3) to quickly calculate TM-scores.

393
As mentioned above, PfamClan filters out protein 394 pairs that share the same Pfam clan domain(Fig.395 1, step (1)).Compared to calculating the simi-396 larity of all protein pairs from scratch (as SS-397 predictor does), using PfamClan to pre-filter 398 greatly reduces the calculation time (Supplemen-399 tary Table 402 more important in the initial pre-filter.Pfam-403 Clan is based on a more relaxed standard of 404 sharing the same Pfam clan domain, instead of 405 sharing the same Pfam family domain.This fea-406 ture helps PfamClan achieve a higher recall than 407 PfamFamily (see "Pfam based pre-filter method" 408 Supplementary Section).

410
Based on the protein language model and SS-411 predictor, PLMSearch performs further similarity 412 prediction based on the pre-filter results of Pfam-413 Clan (Fig. 1, step (2)).The motivation is that 414 the clustering results based on PfamClan show 415 a significant long-tailed distribution.One or two 416 largest "Big clusters" contain the vast majority 417 of proteins and protein pairs (>13% proteins and 418 >59% protein pairs, see Extended Data Fig.A3).419 As the size of the dataset increases, the number 420 of proteins contained in the "Big clusters" will greatly expand, further leading to a rapid increase in the number of pre-filter protein pairs (Supplementary Table 15).The required computing resources are excessive with TM-align used for all the filtered pairs.PLMSearch uses the predicted similarity instead of the TM-score calculated by TM-align, which helps to greatly increase speed and avoids dependence on structures.Protein language model [27-29] learns only from sequence data and generates deep representations (embeddings) that contain multiple biological properties of proteins.Such deep representations have been widely used in various downstream tasks [30], especially the secondary structure prediction and contact prediction tasks that are related to structure [31].Good performance on these tasks demonstrates that the deep representation (embedding) already contains the protein language model's understanding of protein structure.As shown in Extended Data Fig.A4, the input protein sequences are first sent to the protein language model (ESM-1b here) to obtain the embedding of each amino acid, and finally the protein embedding is obtained through the average pooling layer.Subsequently, SS-predictor predicts the structural similarity (TM-score) between proteins through a bilinear projection network.SSpredictor multiplies the predicted TM-score by the COS distance between protein embeddings as the final similarity.
consists of single domains 455 with real structures.Clustering of SCOPe 2.01 456 [42, 43] at 40% sequence identity yielded 457 11,211 non-redundant protein domain structures 458 ("SCOPe40").As done in Foldseek, domains from 459 SCOPe40 were split 8:2 by fold into SCOPe40-460 train and SCOPe40-test sets, and then domains 461 with a single chain were reserved.We trained SS-462 predictor on SCOPe40-train and performed tests 463 on SCOPe40-test as a benchmark.This means 464 that the sequences and folds in the test set are 465 both unseen during training time.In addition, 466 in order to make a more objective comparison, 467 the settings and metrics used in the all-versus-all 468 SCOPe40-test search test are exactly the same as 469 those used in Foldseek.
470 4.4.2Swiss-Prot 471 Unlike SCOPe, the Swiss-Prot dataset consists of 472 full-length, multi-domain proteins with predicted 473 structures, which is closer to real-world scenar-474 ios.Because the throughput of experimentally 475 observing the structure of proteins is very low and 476 requires a lot of human and financial resources.

477
The number of real structures in datasets like 478 PDB [44-46] tends to be low.AlphaFold protein 479 structure database (AFDB) obtains protein struc-480 ture through deep learning prediction, so it can 9 Figures and Tables 899

Fig. 1 Fig. 2 Fig. 3
Fig. 1 Overview of the PLMSearch pipeline.a, PfamClan.Initially, PfamScan [41] identifies the Pfam clan domains of the query protein sequences, which are depicted in different color blocks.Subsequently, PfamClan searches the target dataset for proteins sharing the same Pfam clan domain with the query proteins.Notably, the last query protein lacks any Pfam clan domain, and therefore, its all pairs with target proteins are retained.b, Similarity prediction.The protein language model generates deep sequence embeddings for each protein pair.Subsequently, SS-predictor predicts the similarity of each protein pair.c, Search result.Finally, PLMSearch sorts the protein pairs based on their similarity and outputs the search results for each query protein separately.

Fig. 4 Fig. 5
Fig.4Case study of remote homology pairs.Protein pairs with dissimilar sequences but similar structures are defined as sequence identity < 0.3 but TM-score > 0.5 and are called remote homology pairs here.The sequence identity between Q08558 (the first sequence, blue structure) and I6Y3U6 (the second sequence, green structure) is low.Thus, it is difficult to find this remote homology pair only through the sequence alignment (For the convenience of presentation, only the sequence alignment results of the first 41 amino acids are shown).Like structure search methods (Foldseek and TM-align), PLMSearch, powered by the protein language model, captures the deep protein homology pair that is missed by MMseqs2.

Fig. 6
Fig. 6 Definition diagram.An example composed of three query proteins and five target proteins to explain various cases in the evaluation based on the TM-score benchmark.The a-f cases are represented by the corresponding legend a-f.Among the 15 protein pairs, three "High TM-score protein pairs" are marked with a; six "Medium TM-score protein pairs" are marked with b; six "Low TM-score protein pairs" are marked with c; six "Filtered protein pairs" are marked with d.The protein pair at (3,3) has a TM-score higher than 0.5 but is not filtered out, which is a "Missed pair" marked as e.The protein pair at (3,5) has a TM-score lower than 0.3 but is filtered out, which is a "Wrong pair" marked as f.
169mentary Table3).It is worth noting that the time 170 required for preprocessing grows linearly (time 171 complexity O(n)).It takes only 0.03 s to gen-172 erate the embedding and 0.26 s to generate the 173 Pfam result of each protein, if tested in Swiss-174 Prot.However, the number of protein pairs to 175 be searched grows quadratically (time complex-176 ity degree O(n * m)).Therefore, in large-scale 177 protein searches, reducing search time is more crit-