SENSE-PPI reconstructs protein-protein interactions of various complexities, within, across, and between species, with sequence-based evolutionary scale modeling and deep learning

Ab initio computational reconstructions of protein-protein interaction (PPI) networks will provide invaluable insights on cellular systems, enabling the discovery of novel molecular interactions and elucidating biological mechanisms within and between organisms. Leveraging latest generation protein language models and recurrent neural networks, we present SENSE-PPI, a sequence-based deep learning model that efficiently reconstructs ab initio PPIs, distinguishing partners among tens of thousands of proteins and identifying specific interactions within functionally similar proteins. SENSE-PPI demonstrates high accuracy, limited training requirements, and versatility in cross-species predictions, even with non-model organisms and human-virus interactions. Its performance decreases for phylogenetically more distant model and non-model organisms, but signal alteration is very slow. SENSE-PPI is state-of-the-art, outperforming all existing methods. In this regard, it demonstrates the important role of parameters in protein language models. SENSE-PPI is very fast and can test 10,000 proteins against themselves in a matter of hours, enabling the reconstruction of genome-wide proteomes. Graphical abstract SENSE-PPI is a general deep learning architecture predicting protein-protein interactions of different complexities, between stable proteins, between stable and intrinsically disordered proteins, within a species, and between species. Trained on one species, it accurately predicts interactions and reconstructs complete specialized subnetworks for model and non-model organisms, and trained on human-virus interactions, it predicts human-virus interactions for new viruses.


Introduction
Protein-protein interaction (PPI) networks play a key role in biology and medicine in the interpretation of protein functions in cellular processes.In the past two decades, working with networks has significantly advanced our understanding of the relationships between molecules [31,59,43,34,19,20].This was possible thanks to many computational attempts [37,16,29,67,49,44,12,30,64,10,27,9,13,21], among which recent deep learning architectures [4,61,53,52], and high throughput experimental methods such as yeast two-hybrid [62,17] or tandem purification [2,66] that have been extensively developed.A particular concern is their level of noise and incompleteness [22].In addition, various studies give insights into the precise spatial organization and dynamic temporal remodeling of local protein interaction networks within the cell [48,26] highlighting that PPI networks are only "projections" of particular spatio-temporal PPI realizations.
For instance, within each individual, genomic alterations contribute to di↵erent PPI realizations.Accordingly, understanding any biological process demands defining three parameters: the composition of the "underlying protein network", its organization in space, and its evolution over time.Future technological developments will bring an overwhelming amount of precise information on these genomics-spatio-temporal dimensions and sophisticated computational tools for extracting information from them are mandatory.Ultimately, PPI networks should be understood within biological frameworks including transcriptomic and epigenomic data.Here, we address the first step of this development, which is the construction of the "underlying protein network" which will serve as the basis for more sophisticated and realistic reconstructions.Because of the di culties explained above, which are intrinsic to experimental data, ab initio computational reconstruction of PPIs is supposed to provide invaluable information leading to the identification of protein partners.
Today, the problem of PPI network reconstruction is becoming particularly important due to the wealth of sequence data available from di↵erent species and the need to understand proteomes within and between species which most of the time are non-model organisms, that is species that cannot grow in the laboratory, have a long life cycle, low fecundity or poor genetics for instance.At the same time, the identification of protein partners in PPIs puts new deep learning approaches to the test, for discovering sophisticated correlations within pairs of interacting protein sequences, and for estimating the absence of such correlations when the interaction does not exist in the cell.We take up this challenge and propose SENSE-PPI, an ab initio deep learning approach for protein partner identification, coupling layers of gated recurrent units (GRU) [5] with the ESM2 protein language model (PLM) encoding sequences [25], to achieve optimal identification for protein partners spanning a large spectrum of biological functions and leading to the reconstruction of PPI networks.Indeed, the complexity of the problem may vary depending on the features and the origins of proteins.SENSE-PPI is a generalizable architecture that can be applied to many types of protein datasets for PPI reconstruction.
It has been trained and tested on 1. the reference yeast dataset where the same proteins appear both in training and testing; 2. the human D-SCRIPT dataset [53] where close homologs might be shared in training and testing; 3. a second human dataset from the STRING database where sequences appearing in training and testing have been filtered based on homology; 4. the fly, warm, yeast, mouse, chicken and bacteria datasets used for validation based on training realized on a di↵erent species, in our case the human proteome; 5. the horse, cow, snake, and aphid datasets used for testing non-model organisms; 6. the human-virus dataset for protein interaction between species; 7. the dataset IDPpi [39] for interactions with human intrinsically disordered proteins.Based on this ample spectrum of datasets, we could evaluate the generalizability of SENSE-PPI and its limitations.

Results
We leverage the power of deep learning to identify correlations in interacting sequences, and to estimate their absence when there is no interaction.Our architecture, SENSE-PPI, exploits the latest generation of PLMs, ESM2, and recurrent NN, GRU, to predict the interaction of protein pairs with di↵erent features on a large scale.
Validation is carried out using a wide range of interaction datasets of di↵erent complexities, from model and non-model organisms, inter-species interactions, human-virus interactions, and interactions with intrinsically disordered human proteins.A large new dataset of interactions in Homo sapiens, comprising over a 1 million interactions, is designed on the basis of a restrictive "neighboring-exclusion" condition that guarantees more accurate predictions, a property that is particularly required when laboratory tests are planned.SENSE-PPI is compared with state-of-the-art deep learning approaches, PIPR, D-SCRIPT, Topsy-Turvy, and STEP, in various scenarios.

A yeast dataset for a reference comparison
SENSE-PPI was first challenged on the Guo's yeast dataset [12], used as a reference for method comparison.
We compare SENSE-PPI with Guo's performance, PIPR and STEP.All approaches provide significantly high scores, and these scores are very close to each other.The average Matthews' correlation coe cient (MCC; see Methods) of SENSE-PPI is 95.46, which represents an improvement on the other deep learning models (94.77 and 94.17 for STEP and PIPR respectively), although this di↵erence is not substantial (see Table SI 1).The exceptional success of all methods is explained by the characteristics of the computational experiment.In fact, all models were tested by performing 5-fold cross-validation on a dataset of protein pairs where the same protein appears in several pairs in the full dataset.Since the testing data represent a randomly chosen subset of sequence pairs at each cross-validating fold, in the validation step, each model mostly evaluates proteins already "seen" during training.Such pairs are known to show better performance than those excluded from training [35,13].

Training on the human proteome and testing on model and non-model organisms
Recent global e↵orts to sequence the biodiversity of species [60,3,23,24,41], make PPI predictions in yet unexplored organisms a major challenge.Here, we test SENSE-PPI on non-model species and assess its generalization capabilities for predicting interactions in model organisms that are phylogenetically distant, to varying degrees, from the species used for training.The SENSE-PPI model was trained on H. sapiens and tested on PPI data from several di↵erent species: H. sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Saccharomices cerevisiae, Escherichia coli, Gallus gallus, Equus caballus, Bos taurus, Notechis scutatus, and Acyrthosiphon pisum.The first seven organisms are model organisms but the last four are not.

Benchmark testing for model organisms
To compare the performance of SENSE-PPI with state-of-the-art approaches on model organisms, we repeated the training and testing procedures described in [53].SENSE-PPI was trained on the large D-SCRIPT human dataset, specially designed to avoid protein redundancies in training and validation sets.Table 1 presents the comparison of SENSE-PPI performance with those of PIPR , D-SCRIPT and Topsy-Turvy trained on the same human data.The overall performance remains high for all four DL models when testing is carried out within H. sapiens.However, scores decline progressively when tests are carried out on the other four model species.
As the species' evolutionary distance from H. sapiens increases, performance decreases progressively: the most accurate predictions are obtained for M. musculus and the worst for S. cerevisiae.SENSE-PPI outperforms all other DL methods on all model species tested and, more importantly, o↵ers greater generalization ability in several respects.First, the ability of SENSE-PPI to distinguish between positive and negative interactions is measured with the AUROC score (see Methods) which remains above 0.9 for all test sets, while D-SCRIPT goes from an AUROC score of 0.833 for H. sapiens down to 0.790 for S. cerevisiae.Second, SENSE-PPI's performance results in an F1-score (see Methods) of 0.848 within H. sapiens that remains above 0.737 for M. musculus, D. melanogaster, and C. elegans.This result shows that PPI networks of model species whose common ancestor is estimated at 708 million years from H. sapiens [18] remain well predicted by SENSE-PPI trained on H. sapiens.The SENSE-PPI F1-score decreases to 0.555 for S. cerevisiae, sharing a common ancestor with H. sapiens at 1,300 million years.It should be noted, however, that the performance of SENSE-PPI on this very distant species is superior in terms of F1-score to that of PIPR, with 0.555 versus 0.456 on M. musculus, and D-SCRIPT, with 0.402 on H. sapiens.Overall, SENSE-PPI significantly outperforms previously developed methods on both single-species and cross-species tests.

New human training dataset improves accuracy
Motivated by the design of a fair training dataset on which to evaluate SENSE-PPI, we constructed the SENSE-PPI human dataset from interactions in the STRING database v11.5.The set of negative protein pairs was defined by satisfying a stringent "neighboring-exclusion" condition ensuring that A-B is a negative interaction in the training set if: 1.A and B are not known to interact in STRING, 2. no homolog of A is known to interact with a homolog of B at more than 40% sequence identity in STRING, and 3. no homolog, at more than 40% sequence identity, of a known interactor of B is known to interact with a homolog of A. Note that the first two conditions are also used in the construction of the D-SCRIPT human dataset.This dataset is double the size of the D-SCRIPT human dataset, and contains more than 86,000 positive and 860,000 negative pairs.We have trained SENSE-PPI on this new dataset and have compared SENSE-PPI performance on the four D-SCRIPT test sets from four di↵erent species [53].Table 2 shows the F1-score, precision, and recall.
Even though the SENSE-PPI human dataset leads to better performance (F1-score) in 3 cases out of 4, the increase remains on par.For practical usage, it is important to note that the trained models provide di↵erent precision/recall ratios.In particular, the SENSE-PPI human dataset obtains higher values of precision (trading o↵ some recall instead) and might be more valuable when lab experiments should be conducted and target partners identified.It will produce fewer false positives thus increasing the proportion of relevant interactions among all positive predictions.

2.2.3
Testing across the evolutionary tree on model and non-model species Furthermore, we asked whether SENSE-PPI behavior is preserved across the phylogenetic tree and, in particular, for non-model species.For this, we constructed SENSE-PPI datasets for the four non-model and six model species (see Methods).A coherent behavior of SENSE-PPI across species and time evolution would support the use of SENSE-PPI on species at a fixed evolutionary distance from the training one for which little is known about protein interactions.Consistent with our expectations, the behavior of SENSE-PPI is illustrated in Figure 1 (see also Table SI 4).
The evaluation of SENSE-PPI on 10 testing datasets across species provides valuable insights, although it does not o↵er a comprehensive assessment of performance.Hence, to tackle the redundancy problem inherent in the input data seen in training [35,13], we systematically filtered the proteins in the test sets according to di↵erent levels of sequence identity from proteins in the training set and evaluated SENSE-PPI precision accordingly.Previous attempts to analyze the e↵ects of the proximity of training and testing sets mostly used the concepts of C1/C2/C3 classes [35,13].These classes divide the testing set into three subgroups based on how many proteins in a pair were already present in training (two, one, or none respectively).In this work, however, we focus on how performance varies with the increase in the evolutionary distance between the tested species and the species from the training data without relying on this discrete division into three classes.
For this, di↵erent extents of sequence identity were measured with MMseqs2 [54] by searching for the closest matches (i.e., consecutive k-mer matches, based on sequence identity without gaps) between testing and training sequences.For each protein pair in the testing set, we computed a so-called "mean pair sequence identity": first, we computed the maximum sequence identities for both proteins in a pair with respect to all proteins in the training set, and then, we considered the mean of these two values (see Methods).Figure 1A shows the distribution of MCC scores with respect to the mean pair sequence identity for ten di↵erent species.We observe a clear positive correlation: the larger the mean pair sequence identity is, the better the performance.This agrees well with the evolutionary distance between species (Figure 1B): species that are closer to the training data (H.sapiens) tend to have higher scores.However, one can observe a tendency of non-model organisms to have slightly lower scores: Figures 1CD show the MCC vs. pair sequence identity plots for model and non-model organisms respectively.Here, instead of simply taking the mean pair sequence identity for all test entries, the data were divided into bins of size 0.1.The two plots show that the main di↵erence in the evaluation of model versus non-model organisms concerns low values of mean pair sequence identity.With the only exception of E. coli (which is the farthest organism from the training data in terms of evolutionary distance), the MCC scores for model organisms are not lower than 0.3.At the same time, non-model organisms start to have scores greater than 0.3 only when the mean pair sequence identity rises beyond 40%.
As a proof of concept, we tested the non-model organism A. pisum after training SENSE-PPI on a dataset made of species that are evolutionary closer to aphids than to humans.This dataset is a merge of two separate datasets for D. melanogaster and C. elegans (see Methods), sharing a common ancestor with the aphid at 362My and 572My respectively compared to the common ancestor shared with H. sapiens at 708Mys.This merge was necessary due to the absence of su ciently many high-confidence interactions in D. melanogaster alone.Hence, a relatively close model species was chosen, C. elegans, so that the size of the resulting testing set could be comparable to the human D-SCRIPT dataset used in Table 1.In line with our expectations, PPI predictions on non-model organisms are improved when closer model organisms are used to train SENSE-PPI.
Test scores for A. pisum improved from 0.660 to 0.701 in F1-score, and from 0.626 to 0.689 in MCC.Moreover, by combining the training set from the two species with the SENSE-PPI human dataset, we can further improve predictions, achieving an F1-score of 0.742 and an MCC of 0.724.This observation should be taken as a general recommendation, even though the improvement is not exceptionally high due to SENSE-PPI general high performance over a large interval of mean pair sequence identity values as illustrated by the plateau of the curve in Figure 1A.

Larger protein language models improve performance
PLMs significantly contribute to enhanced performance.To see this, we consider two sizes of ESM2 embedding modules and analyze how MCC scores change at di↵erent values of the mean pair sequence identity for the four D-SCRIPT datasets of M. musculus, D. melanogaster, C. elegans, and S. cerevisiae (Figure 1EF).Figure 1E shows the SENSE-PPI performance obtained with the smaller version of ESM2, based on 35M parameters (with 1M more parameters used in the rest of the SENSE-PPI architecture), and Figure 1F demonstrates the performance of the SENSE-PPI regular version, based on 3B parameters (with 2M more parameters for the rest of the SENSE-PPI architecture).The ESM2 model with 3B parameters highly improves the predictions on pairs of proteins with low values of mean pair sequence identity compared to the smaller version of ESM2.For instance, for a pair sequence identity < 0.2, predictions on the fly display an MCC of 0.2 (Figure 1E) versus 0.4 (Figure 1F).Moreover, for entries with high values of mean pair sequence identity, the performance based on both embeddings remains comparable.This remains true for all 4 datasets and shows that ESM2 based on a higher number of parameters has better generalization capability.

Evaluation on the human intrinsically disordered protein dataset
Approximately 33% of eukaryotic proteins have significant disordered regions, with an increasing occurrence of disorder in higher organisms [38], and predicting interactions between proteins that are possibly disordered becomes an urgent demand [32,63,50,51].Deep learning approaches based on sequences are expected to open the way to this challenge.SENSE-PPI was additionally compared to the IDPpi model [39] designed to predict the interaction between proteins with stable structure and intrinsically disordered proteins (IDPs).Table SI 2 shows the results for both models.The models were tested on the same five test sets used in [39]

Human-virus PPIs
Predicting and understanding virus-host PPIs is important for developing new therapeutic interventions, but knowledge of virus-host interactions covered by current databases, such as VirHostNet [11], is limited.We therefore tested the ability of SENSE-PPI to predict interactions between a known human protein and an unknown viral protein.We trained SENSE-PPI on a human-virus PPI dataset [61] that contains interactions between human and viral proteins, i.e.only cross-species interactions are included.Data were split between training and testing, so that test data only included interactions for Epstein-Barr and influenza viruses.The model was trained on the set with a large variety of viral species interacting with human proteins (for dataset details see Methods).We required that all human proteins involved in the test have been seen during training (this dataset corresponds to class "C2" in [13]), a condition that is reasonable in terms of potential applications.
The results of this test are presented in Table SI 3. The overall high scores may be explained by the fact that the model is already familiar with all the human proteins present in the data [13].However, it should be noted that the viral proteins from training and testing are drastically di↵erent: by searching the closest matches for every test viral protein in the training set (using MMseqs2, see Methods), we computed a mean sequence identity of 0.21 for the Epstein-Barr virus and of 0 for the Influenza viruses (no matches were found even with sensitivity parameter values in the range of 7-10 [54]).Nonetheless, the protocol can be potentially useful for further exploration of interactions between di↵erent species, where the model is familiar with all the target proteins of one of the organisms of interest.The problem of predicting cross-species interactions between two unknown proteins remains wide open.

PPI network reconstruction
In order to verify the performance of SENSE-PPI on the ab initio reconstruction of PPI networks, we carried out several tests to reconstruct PPI networks for sets of a few dozen proteins from the STRING database (version 11.5) [58].For each test, we set one or two proteins as "seeds" and collected known partners on the basis of high confidence for physical interaction according to STRING.We then calculated SENSE-PPI predictions for a comparative analysis with STRING data.Figures 2 and 3 show four examples of these tests chosen to illustrate the general characteristics of SENSE-PPI behavior: 1. False positives are frequently related to indirect interactions, that is non-physical interactions with proteins sharing a common partner in the network (Figure 2A); 2. SENSE-PPI often misses interactions for weakly connected proteins (Figure 2B); 3. it successfully identifies interactions involved in functionally distinguished subnetworks (Figure 2A and Figure 3); 4. it successfully predicts PPIs in other species (Figure 2C).These four sets group interacting human proteins (Figure 2AB), C. elegans proteins (Figure 2C) and M. musculus proteins (Figure 3).The tests were carried out on the model trained on the D-SCRIPT human dataset.
Three sets of proteins are analyzed in Figure 2 where heatmaps display confidence scores for STRING (lower triangle) and prediction scores for SENSE-PPI (upper triangle).Although we used the same color spectrum for both sources, there is no direct correlation between the SENSE-PPI scale and the STRING scale, since our model was trained on binary labels and STRING provides confidence estimates that were not taken into account during SENSE-PPI training (except for the fact that we only used STRING high-scoring interactions in the training set).In fact, interactions used for training or validation were not used for testing (see black cells in heatmaps).An intuitive representation of the heatmap data is illustrated by three associated graphs.
The first protein set (Figure 2A) is seeded on C1R and RFC5 proteins from H. sapiens, and encompasses 12 other human proteins meant to interact with at least one seed protein.Proteins for which the majority of positive interactions in a given subset were already present during training have been omitted.C1R and RFC5 were chosen based on their subcellular localization: C1R is secreted in extracellular regions, while RFC5 is expressed in the nucleus.Moreover, these proteins possess di↵erent functional characteristics: while C1R is the first component of the classical pathway of the complement system within the immune system, RFC5 takes part in DNA replication in the cell cycle.SENSE-PPI clearly distinguishes the two functionally independent networks without any uncertainty, assigning very low scores to all protein pairs across the subnetworks.Some false positive interactions are present and strictly concentrated in the RFC5 complex.They concern the PCNA protein, which interacts, at di↵erent strengths in STRING, with RFC5 and many, but not all, of its partners.
For this set, SENSE-PPI inference of false positives is often observed for STRING highly connected proteins.In such cases, reconstructions are not completely false, as a considerable proportion of errors are made on protein pairs that may not interact physically, but are either functionally linked or homologous to actual physical partners.
The second set is seeded on the human TUFM protein and 12 other proteins (Figure 2B), known to belong to the TUFM complex as described in STRING with the highest scores.TUFM is an elongation factor localized in the mitochondrial outer membrane.SENSE-PPI shows an excellent reconstruction ability for this complex, with very high scores provided for the interactions even though multiple proteins have not been seen in training.
However, a weakness is observed for proteins with a low degree of interaction in the network, such as TSFM, ATG5, and ENSP00000449544, where SENSE-PPI fails to predict interaction.Indeed, SENSE-PPI tends to be conservative and does not assign positive labels to potential interactions, resulting in a considerable number of false negatives especially those associated with low-degree nodes.
The third set is seeded on the T25G3.4 protein from C. elegans (Figure 2C), a protein localized in the mitochondrial inner membrane.SENSE-PPI is able to address the problem of an ab initio network reconstruction in other species relatively well.As before, we observe that high-degree nodes give rise to fewer errors, while other parts of the network are more di cult to predict.The heatmap clearly shows that while SENSE-PPI reliably predicts a complete subnetwork, the remaining interactions were predicted with lower scores.
It is important to point out that even though we use STRING both for training and final evaluation, the testing examples may contain low-scoring interactions, indicating that some parts of this information cannot be taken as ground truth and should be treated with caution.
To further check the ability of SENSE-PPI to correctly infer interactions in other organisms, we reconsidered as seeds the C1 complex (C1r, C1s, C1q) with its plasma protease C1 inhibitor Serping1 playing a role in the immune response, and the replication factor RFC complex with Rad17, forming a DNA damage checkpoint complex in the cell cycle, both in M. musculus.In the mouse, as in human, these two subnetworks are not supposed to share common partners since the two complexes involved assure drastic di↵erences in functions and subcellular localizations.We compared SENSE-PPI to the D-SCRIPT and Topsy-Turvy models [52] (see also Table 1), where training was performed for all models on the human D-SCRIPT dataset.For each seed protein, we chose four partners with the highest confidence score in STRING.Figure 3 shows two heatmaps where network reconstruction shows two separate clusters in SENSE-PPI predictions (lower triangle).D-SCRIPT (upper triangle, Figure 3A) correctly predicts the RFC-Rad17 complex network but fails on the C1-Serping1 complex.
Topsy-Turvy improves D-SCRIPT predictions on the scoring values of interaction pairs in the RFC-Rad17 complex by being more confident in the predictions.On the C1-Serping1 complex, it identifies some interactions with variable confidence compared to D-SCRIPT but much less sharply than SENSE-PPI.With an increased number of true positives, the number of false positives increases as well though, and the model incorrectly finds nonexistent interactions between proteins belonging to di↵erent complexes.In contrast, with a clear separation between C1-Serping1 and RFC-Rad17 complexes, SENSE-PPI shows the quality of its predictions.

Discussion
Current PPI experimental datasets are still far away from having reached completion, even on well-studied model organisms such as yeast or humans [55,14], and their curation is an active area of research [46].Today, physical interaction networks obtained by high-throughput techniques are found to include numerous non-functional PPI [22] and at the same time, many missing true interactions.This is the main reason that an ab initio computational reconstruction of PPIs would provide invaluable information.
We introduced SENSE-PPI, a deep learning model for predicting protein-protein interactions based entirely on sequence information.Trained in a single species, SENSE-PPI predicts the proteome within the same species and generalizes to other species.Trained on physical interactions between proteins with stable structures, it generalizes to proteins interacting with IDPs.Trained on the interaction between proteins of human and viral species, it generalizes to the interaction between human proteins and proteins of new viruses.As a general PPI reconstruction strategy for a non-model organism, the user must first identify the best model organism(s) close to the non-model species, train on it, and infer PPI interactions in the non-model species.SENSE-PPI is fast, exceeding the time limits of other DL approaches and outperforming the particularly time-consuming docking approaches [27].The major advantage of ESM2, used in SENSE-PPI, over other PLM models is that it is much faster (several minutes versus several hours compared to ProtT5-XL-UniRef50 [8] for example) while providing highly informative matrices.Indeed, ESMFold, which works with a similar embedding, was able to compete with AlphaFold in terms of performance, being 60 times faster and enabling the reconstruction of over 600 million protein structural models [25].The advantage of deep learning architectures over other computational attempts in PPI reconstruction is that training concentrates all the computational weight, while the combinatorics of the problem (i.e. the quadratic number of potential pairwise interactions among the proteins) is handled e ciently by the trained architecture.As expected, SENSE-PPI shows that the computational issue can be overcome, and that PPIs of tens of thousands of proteins can now be handled.Indeed, the identification of protein partners in a 100-protein dataset takes to takes SENSE-PPI around 2 minutes with a single NVIDIA GeForce RTX 3090.Given that the computational time for ESM2 embeddings scales linearly in the number of proteins (the tensors are computed only once for each protein and then reused) and the prediction time scales quadratically, we can can estimate that it takes around 2 hours to predict the partners of a set of 1,000 proteins and around 3 to 4 hours for 10,000.
Another major advantage of SENSE-PPI over other structure-based approaches to protein interactions, including docking, is that SENSE-PPI achieves far better results based solely on sequences, opening up new directions for development.The di culties associated with intrinsic disorder, protein instability, or transient interactions, which are inherent in protein structures, are avoided by sequence-based methods, o↵ering a truly revolutionary opportunity to accelerate and create innovative applications in the field of proteomics.
To conclude, even though the ab initio reconstruction of PPI networks for model species impressively improved with the years, augmenting the accuracy of the inference within a species remains a problem, especially if we wish to tackle questions on specialized interactions of a protein in di↵erent tissues, or di↵erentiating the interaction of paralogous proteins within the organism.For this, one expects to improve in this direction through the development of deep learning architectures that will help us to interpret and disentangle interaction signals better than now.Also, searching for protein interactions in a species without su cient training information, and training in a model species that is phylogenetically very far away from the species used in testing are still di cult tasks.Testing on multiple species remains a wide-open problem.This is clearly to state that the PPI network reconstruction problem, even when we consider proteins that do not undergo post-transcriptional changes, remains far from being solved but that current achievements can, nonetheless, be applied to the screening of large sets of proteins in search of their partners for many biological questions.
In conclusion, SENSE-PPI is a deep learning model that provides results that were not conceivable a few years ago, with an accuracy that could never be reached by state-of-the-art molecular docking approaches applied to proteins with known structures nor deep learning methods before it.

Guo's yeast dataset: a dataset with proteins shared in training and testing
The Guo dataset [12] comprises 2,497 proteins and 11,188 interactions, with an equal number of positive and negative samples.The 5,943 positive interactions were extracted from the DIP 20070219 database [45].
Sequences are more than 50 amino acids long and the majority of them exhibit pairwise sequence identities < 40% [12].The negative samples were generated through a random pairing of proteins that appeared in the positive data but lacked any evidence of interaction.Pairs were subsequently filtered based on the subcellular localization of the proteins, thereby excluding non-interacting pairs residing in the same location.No further filtering was realized in this dataset to allow a fair comparison with models already evaluated on it.

The D-SCRIPT datasets from human and other species
A large human dataset was constructed to test the D-SCRIPT deep learning model for the prediction of PPIs [53].
The positive interactions were extracted from STRING (version 11) [56], a database encompassing diverse PPI networks and consolidating comprehensive information from various primary sources.They are high-confidence interactions supported by a positive experimental evidence score in STRING.Protein sequences are greater than 50 and smaller than 800 amino acids in length, and protein pairs exhibit pairwise sequence identities < 40%.By construction, any two protein pairs A-B and C-D in the dataset are non-redundant, that is both pairs of sequences A, C, and B, D have < 40% sequence identity respectively.Eliminating redundancy is crucial as it prevents the model from relying solely on sequence similarity when predicting interactions.To establish an appropriate balance between positive and negative PPIs, negative interactions have been generated by randomly picking two proteins from distinguished pairs in the non-redundant set of positive interactions.These negative samples were created in a 1:10 positive-negative ratio with the aim of better approximating the frequency of positive interactions observed in nature.These construction criteria lead to a primary dataset consisting of 47,932 positive protein interactions and 479,320 negative protein interactions for H. sapiens.These interactions were divided into training and testing sets using an 80% -20% ratio, respectively.Additionally, we have used 4 smaller datasets of M. musculus, D. melanogaster, C. elegans and S. cerevisiae from [53] made using the procedure described above and containing 5,000 positive and 50,000 negative interactions each.

The SENSE-PPI human dataset
We constructed a second human PPI dataset directly from the STRING database (version 11.5) [58] by focusing exclusively on physical interactions from H. sapiens.Similar to the D-SCRIPT dataset, we only considered proteins in a length range from 50 to 800 amino acids in order to optimize computational resources and to avoid potential biases introduced by very short or very long sequences.A STRING confidence score threshold of 0.5 was applied and interactions obtained by homology transfer were excluded.We also removed redundant interactions through a clustering strategy that first uses MMseqs2 to cluster proteins at a 40% sequence identity, and then that eliminates redundant pairs as done for the D-SCRIPT dataset.The most important di↵erence between the human D-SCRIPT dataset and the human SENSE-PPI dataset relies on a more stringent condition for constructing negative pairs for training, named here the "neighboring-exclusion" condition.Indeed, a second round of clustering is performed on the proteins in the non-redundant dataset using a 40% sequence identity threshold.Then, a random pair of proteins, A-B, is added to the negative set if no interaction between A and B is known in STRING, and B does not share its cluster with any known interactors of A. This condition ensures that negative pairs are chosen both on the basis of the absence of known interactions and by minimizing the potential biases introduced by homology in shared clusters.As above, in order to reflect the natural frequency of interactions observed in biological systems, the positive-to-negative ratio in the number of protein pairs was set to 1:10.This ratio aims to approximate the relative scarcity of true positive interactions, thus ensuring a more representative distribution of positive and negative examples in the dataset.Overall, the dataset contains 86,304 positive and 863,040 negative interactions.

Ten SENSE-PPI datasets from model and non-model organisms: training on a species and testing on others
We used the same methodology employed for the construction of the SENSE-PPI human dataset to build test sets on four non-model organisms, E. caballus, B. taurus, N. scutatus and A. pisum, and six model organisms, M. musculus, D. melanogaster, C. elegans, Gallus gallus, S. cerevisiae and E. coli.All these test sets contain 15,000 positive and 150,000 negative entries that were extracted from the STRING v12.0 [57].In order to conduct additional testing on the non-model organism A. pisum, we have also constructed a separate training dataset by merging data from D. melanogaster and C. elegans.This dataset contains 23,324 positive and 233,240 negative interactions coming from D. melanogaster and 17,835 positive and 178,350 negative interactions from C. elegans, extracted from STRING v12.0.

The IDPpi dataset describing the interactions between structured proteins and IDPs
The IDPpi dataset [39] contains interactions with intrinsically disordered proteins (IDPs).It is organized in five distinguished datasets corresponding to the testing sets in [39].Each subset comprises 3,500-4,000 pairs of proteins from H. sapiens where one of the partners is intrinsically disordered.Pairs were extracted from the Human Integrated Protein-Protein Interaction rEference (HIPPIE) database [47].Negative interactions were defined through random sampling: two proteins (one that is an IDP and one that has a stable structure) were selected randomly and were assumed to be non-interacting unless they were already known as interacting.The positive-to-negative ratio of each testing set is 1:1.

Human-Virus PPI dataset
This dataset [61] was designed in order to predict interactions between H. sapiens and di↵erent viruses.
Interacting protein pairs were extracted from the Human-Virus Protein-Protein Interactions database (HVPPI) [65] and negative pairs were constructed using the protocol of the dissimilarity negative sampling, which uses a sequence similarity-based method to explore protein pairs that are unlikely to interact [61].
The training set includes 7,371 positive and 117,326 negative interactions between human and viral proteins and contains 16,933 human and 676 viral proteins in total.Viruses span over many families, and the full list of species included in training can be found in the SENSE-PPI repository (data/human virus folder).
We constructed two test sets.The first set consists of interactions involving proteins from the Epstein-Barr virus.This set encompasses 99 viral proteins and 9,426 human proteins, of which only 596 human proteins demonstrate interaction with Epstein-Barr proteins.The set contains a total of 1,308 positive interactions and 14,428 negative interactions.The second dataset exclusively comprises interactions featuring proteins associated with Influenza viruses of type A, B, and C.This dataset includes 4,080 positive pairs and 15,724 negative pairs, incorporating a total of 95 viral proteins and 10,594 human proteins.Among these, 1,533 human proteins are involved in interactions with Influenza virus proteins.Notably, all proteins from both Influenza and Epstein-Barr viruses are excluded from the training process.However, it's important to mention that all human proteins present in the test data have been included in the training set at least once.
Sequence identity scores for proteins in testing sets.Given a pair of proteins in a testing set, we define the 'mean pair sequence identity' as the mean of the sequence identities of the two proteins relative to the training set.Note that for each protein, we consider the maximum of all sequence identity values between the protein and all proteins in the training set.The computations were performed using the mmseqs search command of the MMseqs2 suite [54].In this particular case, in order to calculate the sequence identity MMseqs2 calculates the ratio of identical aligned residues to the total number of aligned columns, which includes columns containing a gap in either sequence.This can be achieved by using --alignment-mode 3.If the search fails to find the similarity with a protein in the training set, the sequence identity value is set to 0 by default.
Evaluation measures.To properly evaluate SENSE-PPI performance we rely on a ground truth set by reference databases such as the STRING database and the following quantities: known interactions that are identified by SENSE-PPI (true positives, TP), interactions identified by SENSE-PPI which are not known (false positives, FP), interactions that are not found by SENSE-PPI but are known (false negatives, FN), and interactions that are not known and are not detected by SENSE-PPI (true negatives, TN).We use four standard measures of performance: recall (sensitivity) = T P/(T P + F N), precision (positive predictive value) = T P/(T P + F P ), F1-Score = 2 T P/(2 T P + F P + F N), and the Matthews' correlation coe cient MCC = (T P • T N F P • F N)/K where K = p (T P + F P )(T P + F N)(T N + F P )(T N + F N).
We also use the precision-recall metric showing the tradeo↵ between precision and recall for di↵erent thresholds.A high area under the curve (AUPRC) represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate.High scores for both show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall).Another measure is the AUROC, calculated as the area under the receiver operating characteristic (ROC) curve, showing the trade-o↵ between true positive rate (TPR) and false positive rate (FPR) across di↵erent decision thresholds.

Design of the SENSE-PPI architecture
Recent advancements in deep learning for protein science have lead to the introduction of pre-trained language models [33], which have been developed for natural language processing to learn compressed, informed, and abstract data representations that are later transferred to proteins by adapting them to amino acid sequences.
PLMs have demonstrated tremendous potential for a broad range of protein-related problems, such as predicting 3D shapes, interactions, mutational outcomes, and subcellular localizations [42,1,15,40].They have been trained on large amounts of sequence data, such as the UniProt database [6].Here, we use ESM2 [25] which is a state-of-the-art general-purpose PLM that has been challenged to predict structure, contact sites, and other protein properties.We used the ESM2 version with 3B parameters, trained on the UniRef50 dataset.
SENSE-PPI is designed as a Siamese architecture (Figure 4) consisting of two identical modules with shared weights that are merged together to provide a single output value p 2 [0, 1] describing the confidence in the interaction, where 0 and 1 stand for low and high confidence respectively.The Siamese design is needed to ensure the commutativity of the model: the output score has to be the same for pairs A-B and B-A.
The model takes two amino acid sequences of variable size (L, M ) as input.Sequences longer than 800 amino acids are trimmed accordingly to this maximum size.The 36 layers of the PLM ESM2 are used to define two embeddings, of size L ⇥ 2560 and M ⇥ 2560, for the two amino acid sequences, respectively.Each sequence is processed by one of the Siamese modules sharing the weights with the second module.The Siamese module is composed of 3 layers of gated recurrent units (GRUs) which process the input sequence from one end to another.Each GRU layer is bidirectional, where half of the units are used for one direction of processing, and the other half for the opposite direction.The final layer produces a vector of shape 256 -it takes only the last output of GRU to perform a many-to-one type of processing, where a sequence is "projected" to a single number for each unit.The gated recurrent units process sequences that are masked beforehand: even though all the sequences are padded to the same length to fit the model, the GRU processes only the part that contains the actual sequence and skips the padding.After the GRU layers, the two output vectors are combined together via the Hadamard product.The resulting vector is passed through two linear layers in order to compute the final score.A dropout of 0.5 is applied at every layer (GRU and linear) except for the ESM2 modules.

Other deep learning architectures used for comparison
We conducted a comparative analysis between SENSE-PPI and several existing sequence-based deep learning models, namely PIPR [4], D-SCRIPT [53], Topsy-Turvy [52] and STEP [28].
PIPR is an end-to-end Siamese architecture that incorporates a deep residual recurrent convolutional neural network.This architecture integrates multiple occurrences of convolution layers and residual gated recurrent units.Each amino acid is represented by an embedding that captures its contextual and physico-chemical relationship within the sequence.PIPR uses a multigranular feature aggregation process, e↵ectively leveraging the sequential and robust local information present in protein sequences.
D-SCRIPT is a deep learning method specifically designed to predict physical PPIs.It features an LSTMbased PLM model for sequence embedding.This is followed by the computation of an outer product between two vector representations, resulting in a three-dimensional tensor.Linear and convolutional layers are then applied to preprocess the tensor, enabling a contact map to be predicted and the probability of interaction to be deduced.D-SCRIPT addresses the challenges associated with the weak cross-species generalization exhibited by previous state-of-the-art models.In addition, it establishes a significant overlap with contact maps for protein complexes with known 3D structures.
The Topsy-Turvy model combines a traditional sequence-based approach (D-SCRIPT) with a networkbased approach (GLIDE [7]) for PPI inference.While relying exclusively on sequence data for predictions,

Implementation details
The implementation of SENSE-PPI was carried out using Python 3 and the PyTorch deep learning framework [36].We trained two versions of the model: a light version, used only for intermediate testing (see Fig. 1E), and a regular version that produced all other results presented here.The regular version is presented in two final forms: the one that was trained on the D-SCRIPT human dataset and the one that was trained on the
The "predict string" command can be used to calculate the predictions for a protein and its known partners according to STRING.The script takes a given number of proteins that are known to interact with a protein of interest, then it performs all versus all predictions and returns the results along with the visualization where one can easily compare the model's score with real data in STRING.Multiple proteins of interest can be used as input at the same time.
The "create dataset" command creates a dataset for a given species from STRING which is based on the algorithm described in Section "The SENSE-PPI human dataset".The taxon ID is used here as the main input.
The package also contains an internal script to compute ESM2 embeddings.This script is a modification of the original code found in the ESM repository [25], it was slightly edited in order to automate processes  testing and performance evaluation.The color scale, labeling the species in the tree, corresponds to the MCC value obtained in testing (as reported in A); colors go from blue (high MCC) to yellow (low).Non-model species are shown with a grey disk behind them.The tree was reconstructed using phyloT v2 (http://phylot.biobyte.de/).Evolutionary information on species divergence times was obtained with TimeTree [18].F1 and MCC scores are reported in Table SI     Details of the Siamese GRU module composed of three bidirectional GRU layers.The last GRU layer follows a "manyto-one" output scheme that produces a vector of a unified length of 256 (128 for one direction and 128 for the opposite one) for proteins of any length.Since all the sequences are padded to the same fixed size during training every GRU layer was implemented to be dynamical.The layers were configured to accept a sequence mask as a secondary input: this was done in order to process positions in a sequence that belong exclusively to the input data and skip all the padding.
but were trained di↵erently.Whereas the IDPpi training set included the IDPs in the test data, SENSE-PPI was trained on the D-SCRIPT human dataset, which contained no information on IDPs.The same performances obtained by the two approaches suggest that by providing more extensive training (note that the D-SCRIPT dataset is 20 times larger than the IDPip training dataset), SENSE-PPI can obtain relatively accurate predictions on IDP interactions.This test once again highlights the generalization capabilities of SENSE-PPI enabling us to extend its use to process IDPs with fairly good accuracy.
The high performance of SENSE-PPI in single model species predictions has been demonstrated here to open up new possibilities in PPI network reconstruction for non-model organisms, which are often characterized by very sparse or absent biological information.Our analysis shows that training on a model species such as H. sapiens and testing on a phylogenetically distant species such as C. elegans, yet provides excellent inference.Moreover, we show that if we construct a training dataset out of model organisms that are evolutionary close to a non-model species, we are able to improve the performance: we succeeded in improving the test scores on A. pisum by switching the training data from H. sapiens to a combination of C. elegans and D. melanogaster.

Topsy-
Turvy employs a transfer-learning strategy during training, assimilating insights from global and molecular views of protein interactions.The model achieves state-of-the-art results, enabling genome-scale PPI predictions.It has been applied to non-model organisms.STEP has been developed primarily for the prediction of virus-host PPIs.The core component of the STEP architecture is a BERT-based PLM, which generates sequence embeddings.These embeddings are combined through elementwise multiplication and further processed by linear layers.
during training and testing with SENSE-PPI.All datasets used in this study are also provided, including both those newly created and those taken from other publications.The two trained versions of SENSE-PPI, on the D-SCRIPT and SENSE-PPI human datasets, are available in the SENSE-PPI repository in the pretrained models folder, under the names dscript.ckptand senseppi.ckptrespectively.In addition, this folder also contains pretrained versions used for other tests presented in this work such as the human-virus model, and models based on a combination of model species (fly/worm as well as fly/worm/human).

Figure 1 :
Figure 1: Performance of SENSE-PPI on six model and four non-model organisms.SENSE-PPI is trained on the SENSE-PPI human dataset.For each experiment, the testing dataset comprises 5,000 positive and 50,000 negative interactions.A: Matthews correlation coe cient (MCC) versus mean pair sequence identity (see Methods) for the ten experiments.Dotted line: least squares polynomial fit with degree 4. B: Phylogenetic tree of species used for SENSE-PPI

4 .
The MCC score for H. sapiens trained on H. sapiens is 0.836.C and D: Plots of MCC scores versus mean pair sequence identity for model (C) and non-model (D) organisms.E and F: Comparison of two versions of SENSE-PPI di↵ering by the size of the ESM2 embedding module.The trend of SENSE-PPI MCC scores versus the mean pair sequence identity is shown for 4 di↵erent testing datasets coming from [53] for the "light" version (E) and the regular version (F).G: Distribution of pairs in four test sets for panels E-F based on the mean pair sequence identity.The data on panels C-G are presented in bins of 10% across the values of mean pair sequence identity.The MCC on panels C-F is computed for a given bin only if it contains at least 40 positive and 400 negative samples.

Figure 2 :
Figure 2: SENSE-PPI ab initio reconstruction of PPI networks and comparison with data from the STRING database.The three sets of potential protein partners are seeded on the C1R and RFC5 proteins in H. sapiens (A), the TUFM protein in H. sapiens (B), and the T25G3.4 protein in C. elegans (C).Each network is described by a heatmap (left) and a graph of interaction (right), where nodes are proteins and edges are interactions.The bottom triangle of the heatmap reports the STRING confidence scores (v11.5), while the upper triangle describes SENSE-PPI prediction scores.Positive and negative interactions present in training/validation are omitted from testing (black masked cells).The graphs represent predicted connections and trace (with distinct colors) information coming from both the STRING database and the training set.Nodes represent proteins present at least once in training (orange), and proteins present only in testing (grey).Four kinds of interactions are essential for comparison: interactions found in STRING and identified by SENSE-PPI (TP), interactions identified by SENSE-PPI that are not present in STRING (FP), interactions that are not found by SENSE-PPI but are present in STRING (FN), and interactions not found in STRING and not detected by SENSE-PPI (TN).Hence, edges show TP present in training (grey) and absent in training (red), FP (blue), and FN (dashed black).The absence of an edge indicates a TN.The SENSE-PPI prediction threshold used for graph reconstruction is 0.5, for the three networks.

Figure 3 :
Figure 3: PPI network reconstruction seeded on the protein complexes C1-Serping1 and RFC-Rad17 in M. musculus.A. SENSE-PPI trained on the D-SCRIPT human dataset was used to reconstruct two networks with independent functionality in the cell.None of the proteins in the sets were observed during training, hence the full set of predicted interactions was considered in the evaluation.The heatmaps show SENSE-PPI (lower triangle), D-SCRIPT (upper triangle, left), and Topsy-Turvy (upper triangle, right) predicted scores, for all possible protein pairs in the set.According to STRING v11.5, all RFC-Rad17 partners and C1-Serping1 partners physically interact with each other while no interaction between these two groups of proteins is known (see B). B. Two graphs representing STRING data on the two complexes.No mouse protein (gray) has been seen in training which is done on H. sapiens.Edge colors represent the STRING confidence score for the interactions.Scale colors as in Figure 2.

Figure 4 :
Figure 4: Schematic representation of SENSE-PPI.A. SENSE-PPI takes two input sequences of lengths L and M through the ESM2 Embedding Module (orange) and transforms them into two large tensors of size L ⇥ 2560 and M ⇥ 2560 respectively.The two matrices are then independently processed by the Siamese GRU Module (green).The final output vectors are combined together using the Hadamard product and further processed with two linear layers in order to calculate the final interaction score in the interval [0, 1].B. Details of the ESM2 Embedding Module mapping each amino acid in the input sequence into a vector of length 2560.The module is essentially the ESM2 model which comprises 36 layers (3B parameters version) The output of the module is a per-token representation of the input.C.
SENSE-PPI human dataset.The light version of SENSE-PPI has a smaller model size, with an ESM2 embedding block containing 35M parameters and a classification head of 1M parameters.In contrast, the regular version of SENSE-PPI has an ESM2 embedding block comprising 3B parameters and a classification head of 2M parameters.The ESM2 module is not retrained, with the only trainable component being the classification head of the model.Each GRU layer contains 256 units followed by a dropout of 0.5.The three final linear layers have 256, 32, and 1 neurons respectively.The training was performed using AdamW optimizer, a batch size of 32, and a learning rate of 10 4 with the exception of the Guo's yeast dataset, for which we used a batch size of 64 and a learning rate of 10 3 .The model was trained on a single NVIDIA GeForce RTX 3090 GPU, and on this setup, the training process with the SENSE-PPI human dataset, of approximately 1M interactions, for the regular version took approximately five days to complete.