Deep contrastive learning enables genome-wide virtual screening

Numerous protein-coding genes are associated with human diseases, yet approximately 90% of them lack targeted therapeutic intervention. While conventional computational methods such as molecular docking have facilitated the discovery of potential hit compounds, the development of genome-wide virtual screening against the expansive chemical space remains a formidable challenge. Here we introduce DrugCLIP, a novel framework that combines contrastive learning and dense retrieval to achieve rapid and accurate virtual screening. Compared to traditional docking methods, DrugCLIP improves the speed of virtual screening by several orders of magnitude. In terms of performance, DrugCLIP not only surpasses docking and other deep learning-based methods across two standard benchmark datasets but also demonstrates high efficacy in wet-lab experiments. Specifically, DrugCLIP successfully identified agonists with < 100 nM affinities for 5HT2AR, a key target in psychiatric diseases. For another target NET, whose structure is newly solved and not included in the training set, our method achieved a hit rate of 15%, with 12 diverse molecules exhibiting affinities better than Bupropion. Additionally, two chemically novel inhibitors were validated by structure determination with Cryo-EM. Building on this foundation, we present the results of a pioneering trillion-scale genome-wide virtual screening, encompassing approximately 10,000 AlphaFold2 predicted proteins within the human genome and 500 million molecules from the ZINC and Enamine REAL database. This work provides an innovative perspective on drug discovery in the post-AlphaFold era, where comprehensive targeting of all disease-related proteins is within reach.


Introduction
The human genome comprises approximately 20,000 protein-coding genes (1), many of which are related to a variety of diseases.Despite this, only about 10% of these genes have been successfully targeted by FDA-approved drugs or have documented smallmolecule binders in the literature (2).This leaves a substantial portion of the druggable genome largely unexplored, representing a promising opportunity for therapeutic innovation.The scientific community is eager to translate biological relevant targets into pharmaceutical breakthroughs.However, most researchers lack access to advanced high-throughput screening equipment or sufficient computational power to perform comprehensive virtual screenings.Additionally, proteins often function as parts of families or pathways, indicating that targeting single proteins may not always be the most effective strategy (3,4).These limitations can significantly reduce the success rate of drug discovery, especially for new targets.Therefore, developing a comprehensive chemical database containing genome-wide virtual screening results would be an invaluable asset for the biomedical research community, with the potential to significantly accelerate the discovery of new drugs.
Given the impracticality of experimentally screening all human proteins, virtual screening has emerged as the only viable approach to tackle the vast number of potential targets.In classical computer-aided drug discovery (CADD), molecular docking serves as a foundational technique for target-based virtual screening.Despite advancements in simplified scoring functions, optimized algorithms, and hardware acceleration (5)(6)(7)(8)(9), molecular docking remains time-intensive, often requiring several seconds to minutes to evaluate each protein-ligand pair.For example, a recent large-scale docking campaign took two weeks to screen 1 billion molecules against a single target, even with the use of 10,000 CPU cores (10).As a result, the computational demands for genome-wide virtual screening are prohibitively high, rendering such efforts impractical with existing technologies.
Artificial intelligence holds great promise for drug discovery.Various deep learning methods have been developed for virtual screening, focusing on predicting ligandreceptor affinities (11)(12)(13).Yet, applying these methods to large-scale virtual screening still face significant challenges.A primary issue is the inconsistency of affinity values due to heterogeneous experimental conditions (14,15), which may negatively impact the performance of the trained model.Moreover, a notable distribution shift between training datasets and real-world testing scenarios hinders the generalizability of AI models, as real-world virtual screenings often involve a larger proportion of inactive molecules than those represented in the curated training sets (16).Additionally, the computational demands of deep learning models, with millions of parameters, pose a crucial bottleneck in inference speed, especially as chemical libraries and target numbers grow.Consequently, there is an urgent need for the development of more efficient and robust AI methodologies to effectively address these challenges.
In this work, we introduce DrugCLIP, a novel contrastive learning approach for virtual screening.Contrastive learning has demonstrated significant success in various applications like image-text retrieval (17), enzyme function annotation (18), and protein homology detection (19).The core innovation of DrugCLIP lies in its ability to distinguish potent binders from non-binding molecules with a given protein pocket by aligning their representations.This approach effectively mitigates the impact of noisy affinity labels and chemical library imbalances that have traditionally challenged virtual screening efforts.Moreover, the inference of DrugCLIP is highly efficient, approximately 100,000 times faster than traditional docking methods and 100 times faster than current machine learning-based approaches.
Comprehensive in silico and wet-lab evaluations were conducted to assess the accuracy of the DrugCLIP model.Our model achieved state-of-the-art performance on two widely recognized virtual screening benchmarks, DUD-E (20) and LIT-PCBA (21), outperforming traditional docking-based screening methods and other deep neural networks.To further validate its performance, DrugCLIP was applied to screen molecules for two real-world targets: 5HT2AR (5-hydroxytryptamine receptor 2A) and NET (norepinephrine transporter).Remarkably, our model identified chemically diverse binders with adequate affinities, which were further validated through functional assays and structure determination.These results provide compelling evidence of the efficacy of our virtual screening method.
Finally, a genome-wide virtual screening was conducted using DrugCLIP on all human proteins predicted by AlphaFold2 (22,23).In this process, we first define pockets for AlphFold predictions with structure alignment (24), pocket detection software (25), and generative AI models.Next, we screened over 500 million drug-like molecules from the ZINC (26,27) and Enamine REAL (28) databases against identified pockets.Notably, this unprecedented large-scale virtual screening was completed in just 24 hours on a single computing node equipped with 8 A100 GPUs.Lastly, we applied a CADD cluster-docking pipeline to select chemically diverse and physically proper molecules for each pocket.These result a dataset containing over 2 million potential hits targeting more than 20,000 pockets from around 10,000 human proteins.To the best of our knowledge, this is the first virtual screening campaign to perform more than 10 trillion of scoring operations on protein-ligand pairs, covering nearly half of the human genome.All molecules, scores, and poses have been made freely accessible at https://drug-the-whole-genome.yanyanlan.com,facilitating further research in drug discovery on a genome-wide scale.

Results
The design and in-silico evaluation of the DrugCLIP model Unlike previous machine learning models that relied on regression to directly predict protein-ligand affinity values, DrugCLIP (Fig. 1) redefines virtual screening as a dense retrieval task.The key innovation lies in its training objective, which aims to learn an aligned embedding space for protein pockets and molecules, encoded by separated neural networks.Vector similarity metrics can then be employed to reflect their binding probability.Using contrastive loss during training, the similarity between protein pockets and their binders (positive protein-ligand pair) is maximized, whereas the similarity between protein pockets and molecules binding to other targets (negative protein-ligand pairs) is minimized.
The training process of DrugCLIP includes two stages: pretraining and fine-tuning.
The molecule and pocket encoders are pretrained with large-scale synthetic data and are further refined using experimentally determined protein-ligand complex structures during fine-tuning.
In the pretraining stage, the molecule encoder is initialized with Uni-Mol, a wellestablished molecule encoder.With the molecule encoder frozen, the pocket encoder is randomly initialized and trained to align with the molecule encoder using contrastive learning (Fig. 1B).we developed a Protein Fragment-Surrounding Alignment (ProFSA) framework (Fig. 1A) to generate large-scale synthetic data specifically tailored for contrastive pretraining.In this approach, short peptide fragments are extracted from protein-only structures to serve as pseudo-ligands, while their surrounding regions are designated as pseudo-pockets.Appling ProFSA framework to PDB (29) data yielded 5.5 million pseudo-pocket and ligand pairs to facilitate the pretraining.The trained pocket encoder has been evaluated across various downstream tasks such as pocket matching (Supplementary Table 1), pocket property prediction (Supplementary Table 2) and protein-ligand affinity prediction (Supplementary Table 3).Experimental results demonstrate that our pretrained pocket encoder exhibits strong performance, even in a zero-shot setting, outperforming many supervised learning-based models as well as physical and knowledge-based models.These results underscore the success of the pretraining stage to obtaining meaningful molecule and pocket representations.
After pretraining, the molecule and pocket encoders are further fine-tuned (Fig. 1D) using 40,000 experimentally determined protein-ligand complex structures collected by BioLip2 database (30).Given that the binding conformations of molecules are unknown and only their topologies are provided in virtual screening, we implemented a random conformation sampling strategy for data augmentation by using RDKit (31) for conformation generation.This augmentation allows DrugCLIP to train on data that more accurately reflect the variability of real-world screenings, thereby enhancing the model's performance and generalization ability.
In the screening process (Fig. 1E), we first use our trained encoders to represent molecules and pockets to vectors.Cosine similarities between the pocket and molecule embeddings are then computed, and candidate molecules are ranked according to these similarity scores.Since the molecule representations can be computed offline, DrugCLIP screening is highly efficient, requiring only the calculation of a simple cosine similarity and subsequent ranking.Compared with traditional computational method like docking, DrugCLIP is more than 100,000 times faster.pocket and molecule encoders were updated using a contrastive loss, which maximizes the similarity between positive pairs and minimize it between negative pairs.(E) The pipeline for virtual screening with DrugCLIP.The candidate molecules from the library were pre-encoded with the trained molecular encoder.For a given pocket, the trained pocket encoder converts it to a vector, and the cosine similarity is then utilized to select top ligands with the highest scores.
We also investigated the influence of homology information and protein structure accuracy on DrugCLIP's performance.Remarkably, even when test protein families were entirely excluded from the training set, DrugCLIP still outperform one of the most popular virtual screening method AutoDock Vina (Fig. 2C), highlighting its strong generalization capability to new targets.Moreover, DrugCLIP shows exceptional robustness by outperforming AutoDock Vina even with a 3 Å RMSD error in the side chain conformations of protein pockets (Fig. 2D), indicating its robustness to structural inaccuracies.Furthermore, DrugCLIP is exceptionally efficient (Fig. 2E), making it highly suitable for large-scale screening tasks.For instance, DrugCLIP completed the screening for LIT-PCBA in merely 38 seconds with the aid of pre-encoding acceleration, significantly faster than Glide docking (3 days), Uni-Dock (22 hours) (8), and another machine learning method PLANET (3 hours) (11).Moreover, the time consumption of DrugCLIP screening scales linearly with the simultaneously increase of target and molecule numbers (Fig. 2F), which can facilitate multi-target virtual screening.
These in silico results confirm that DrugCLIP possesses superior virtual screening capabilities, combining high performance, generalizability, robustness, and efficiency.

Validating the performance of DrugCLIP with wet-lab experiments
In addition to in silico evaluation, we tested the DrugCLIP model on real-world targets using wet-lab experiments.We focused on two well-established targets for psychiatric diseases: the serotonin receptor 2A (5HT2AR) and the norepinephrine transporter (NET).
5HT2AR is an emerging target for antidepressant development.Psychedelics such as lysergic acid diethylamide (LSD) and psilocybin act as agonists of 5HT2AR.Despite their potent hallucinogenic effects, these psychedelics have demonstrated strong and long-lasting antidepressant effects in both rodent models and humans (39,40), making 5HT2AR a crucial target for the development of next-generation antidepressants.
Previous research suggests that the recruitment of β-arrestin2 following 5HT2AR activation is a key biochemical mechanism underlying these antidepressant effects (41,42).
Since the structure of 5HT2AR was included in our training data, we trained a customized DrugCLIP model more specifically for 5HT2AR virtual screening.We excluded similar experimental structures (90% homology) from the training dataset to prevent potential data leakage.A library of 1,648,137 in stock molecules from ChemDiv were screened, and 78 compounds from the top results were ordered, selecting with criteria including drug-likeness, docking score, and chemical diversity.
Firstly, a calcium flux assay was used for a primary screening of these compounds in an agonist model at a concentration of 10 μM.Eight of the 78 molecules were identified as positive agonists, exhibiting a minimal activity of 10% compared to serotonin.The affinities of these molecules to 5HT2AR were further assessed using [³H]-labeled ketanserin competitive binding assays, with six showing a Ki of less than 10 μM.We then evaluated the cellular function of these hit molecules using NanoBit assays for β-arrestin2 recruitment, and all six molecules achieved an EC50 of less than 1 μM.
Among the six validated agonists, two chemically novel molecules are particularly interesting.The molecule with the highest affinity, V008-4481, binds to the orthosteric (43), extended (44) and side-extended (41) pocket of 5HT2AR with its indole ring, phenylpiperazine part and linear tail respectively.In previous researches, molecules that bind to all three regions are rarely seen.This molecule achieves an affinity of 21.0 nM and exhibits an EC50 of 60.3 nM with an Emax of 35.8% in the NanoBit assay.Another molecule, V006-3328, features a unique branched topology uncommon among 5HT2AR binders.It has an affinity of 3.51 μM, with an EC50 of 599 nM and an Emax of 23.4% in the NanoBit assay.These two molecules present strong potential as starting points for the discovery of new serials of 5HT2AR agonists.
Unlike 5HT2AR, the norepinephrine transporter (NET) is a well-established drug target for depression and attention deficit hyperactivity disorder (ADHD), with multiple FDA-approved inhibitors (45).However, the structures of NET with or without its inhibitors in complexes were not solved until 2024 (46)(47)(48).The closest protein structure in our dataset is the dopamine transporter from Drosophila (49), which shares less than 60% similarity with NET.Therefore, screening against NET provides a more challenging test of our model's ability to generalize to relatively new targets.
For this target, we directly utilized the DrugCLIP model trained with all collected complex structures to screen the ChemDiv in-stock library.We ultimately selected 104 compounds considering chemical novelty and diversity.We tested their inhibition of NET protein by measuring the transport of [³H]-labeled norepinephrine in NETcontaining liposomes.Among these compounds, 15% of them exhibited more than 60% inhibition of NET, with 12 compounds demonstrating greater potency than the widely used antidepressant bupropion.
Unlike previous NET inhibitors that typically feature aliphatic nitrogen atoms capable of forming a salt bridge interaction with ASP75 of NET (46)(47)(48), our screening identified several hits with positively charged aromatic nitrogen atoms.Notably, two such molecules, 0086-0043 and Y510-9709, demonstrated better IC50 (with values of 1.14 μM and 0.31 μM, respectively) than bupropion (1.5 μM).Structural analysis of the complexes between these compounds and the NET protein revealed that the aromatic rings indeed form more favorable interactions with NET: the isoquinoline ring of 0086-0043 engages in a T-shaped π-π interaction with PHE72, and the thiazole ring of Y510-9709 likely interacts with surrounding aromatic side chains like PHE323.These findings highlight the potential of DrugCLIP model to providing new chemical insights for drug discovery.Finally, we introduced a genome-wide virtual screening pipeline to facilitate future drug discovery.We began with splitting AlphaFold predictions into high-confidence regions based on plDDT and PAE scores.For each region, we used homology alignment and Fpocket (25) along with GenPack to detect potential pockets.The DrugCLIP model was then employed to screen over 500 million drug-like molecules from the ZINC (26,27) and Enamine REAL (28) databases.The screening process, which involved more than 10 trillion of scoring operations on protein-ligand pairs, was completed in about 24 hours on a single computing node equipped with 8 A100 GPUs.The top-ranked molecules were then clustered and further evaluated using molecular docking, filtering out poor poses with Glide score < -6 kcal/mol.The final database contains over 2 million potential hits molecules for more than 20,000 pockets from 10,000 human targets.All molecules, docking scores and poses have been made freely accessible at https://drug-the-whole-genome.yanyanlan.com,facilitating further research and drug discovery processes.
Our genome-wide screening results cover a more extensive range of targets than ChEMBL (54), one of the most comprehensive databases for bioactive molecules.
While UniProt (1) contains 20,436 reviewed human proteins, the latest ChEMBL release (ChEMBL 34) covers 4,810 of them.Moreover, not all targets in the ChEMBL database have high-affinity small molecule binders; some targets only have peptide or antibody binders, or merely vague results from low-quality assays.In contrast, our database spans 9,908 targets, more than twice the number in ChEMBL and covers nearly half of the human genome.To visualize the difference between the two protein spaces, we encoded all protein sequences using the ESM1b model (55).The t-SNE plot shows that our space encompasses a broader range of proteins, including many that are not closely related to those in ChEMBL.
Our database includes a diverse range of targets, from well-studied proteins to lessexplored members of well-known families, as well as proteins with limited pharmacological understanding.For example, the c-Jun N-terminal kinase 3 (JNK3) is a classical kinase target with many ligand-bound crystal structures (56,57).DrugCLIP identified molecules binds to the ATP-binding pockets, forming H-bonds with backbone atoms of MET149 in the hinge region.SLC45A2 belongs to the solute carrier (SLC) superfamily, many of whose members are important drug targets.Nevertheless, SLC45A2 has limited pharmacological studies.This gene plays a crucial role in pigmentation (58) and is widely expressed in cutaneous melanomas (59), with evidence suggesting its oncogenic potential (60).All molecules in the database could bind near L374, which is an important site for protein stability (58), thus having potential modulatory effects.Another interesting example OR6A2 belongs to the olfactory receptor family, whose members are mainly found to be expressed in olfactory receptor neurons, yet many of them are expressed in various other tissues with unexplored pharmaceutical potentials (61).OR6A2 is expressed in macrophages, sensing blood octanal and promoting the formation of atherosclerotic plaques (62).Our predicted molecules fit the orthosteric pocket of OR6A2 and can serve as potential inhibitors for treating atherosclerosis.The final example Sestrin-2 can sense leucine (63) and promote drug resistance of cancer cells (64), which belongs to a unique highly-conserved stressinducible protein family (PF04636 or IPR006730) with only three members in the human genome.Our database contains predicted molecules bind to the same pocket of leucine (65) that may serve as good starting points for anti-cancer therapies.These examples highlight the potential of our database as a valuable resource for exploring the undrugged genome and facilitate future drug discovery.

Conclusions and Discussions
With the rapid advancement of protein structure prediction methods and the availability of a comprehensive atlas of predicted protein structures for human and disease related species (23,66), we have entered a new era where effective drug discovery for all disease-related targets is within reach.In this paper, we introduce DrugCLIP, a groundbreaking contrastive learning based virtual screening approach that aims to achieve genome-wide drug discovery.The efficacy of DrugCLIP has been rigorously validated through both in silico benchmarks and wet-lab experiments.In well-established benchmarks, DrugCLIP consistently outperformed traditional docking software and contemporary machine learning models.Notably, for the 5HT2AR and NET targets, DrugCLIP identified diverse high-affinity binders and novel chemical entities.These findings underscore the potential of DrugCLIP model as a reliable tool for virtual screening in real-world drug development.We demonstrate its application through a genome-wide virtual screening campaign, encompassing more than 20,000 pockets across approximately 10,000 human proteins, using a chemical library of 500 million molecules from ZINC and Enamine REAL.Remarkably, DrugCLIP completes this trillion-level virtual screening campaign in just 24 hours using just a single computational node with 8 GPU accelerators.Beyond the screening results, we have generated over 2 million high-confidence protein-ligand complex structures accompanied with their docking score.By making this extensive database freely accessible, we aim to make a substantial contribution to the research community, accelerating drug discovery and fostering innovation in therapeutic development.

Fig. 1
Fig.1 The framework of DrugCLIP.(A) In the pretraining stage, a large-scale synthetic data was created using the ProFSA strategy.Specifically, pseudo pocket-ligand pairs were constructed through a series of operations, including fragment segmentation, terminal correction, neighbor removal and pocket detection, on protein data.(B) The pocket encoder is pretrained with pseudo pocket-ligand pairs in a contrastive distillation manner to transfer knowledge from a well-established molecular encoder to the pocket encoder.(C) During the fine-tuning process, experimentally determined protein-ligand pairs were used as training data, with multiple ligand conformations generated by RDKit.(D) In the fine-tuning stage, both the

Fig. 2
Fig.2 In silico benchmarking results of DrugCLIP.(A) The evaluation of DrugCLIP on the DUD-E dataset using the EF1% to assess model performance.The results of baseline models are token from previous studies (11, 37).(B) The evaluation of DrugCLIP on the LIT-PCBA dataset, also using the EF1% for performance measurement.The results of baseline models are token from previous studies (11, 21, 36, 38).(C) The assessment of DrugCLIP's generalization ability was conducted by varying the identity cutoffs between testing targets and training data in DUD-E, with Glide-SP and Vina represented as dashed lines.(D) The evaluation of DrugCLIP's robustness regarding errors in pocket side-chain conformations, was conducted by using RMSD values ranging from 0 Å to 3 Å, with Vina shown as a dashed line for reference.(E) The screening speed on the LIT-PCBA dataset, compared with docking methods like Glide-SP and Uni-Dock, and the machine learning model PLANET.Speeds of baseline methods are token from previous studies (8, 11).(F) An illustration of time consumption as the screening scale increases, with the x-axis representing the size of the molecule library and the size of the circles representing the numbers of targets.DrugCLIP (the orange line) has a computational complexity of O(M+N), where M is the number of targets and N is the number of compounds, whereas most existing methods (the green line) have a complexity of O(MN).

Fig. 3
Fig.3 Wet-lab validations of DrugCLIP.(A) The screening results of 78 DrugCLIP identified molecules using calcium flux assays for 5HT2AR agonist at a concentration of 10 μM.Eight molecules showed signals larger than 10%.(B) The 7WC8 structure was used as the receptor for docking analysis.(C) The docking pose of V006-3328 to 5HT2AR.(D) The dose response curve of V006-3328 in the NanoBit assay measuring the β-arrestin2 recruitment.(E) The docking pose of V008-4481 to 5HT2AR.(F) The dose response curve of V008-4481 in the NanoBit assay measuring the β-arrestin2 recruitment.(G) The evaluation of 104 DrugCLIP identified molecules with radio-ligand transportation assays for NET inhibitor at a concentration of 10 μM, and 16 molecules showed inhibition larger than 60% (H) The complex structure of 0086-0043 and NET determined with Cryo-EM.(I) The dose response curve of 0086-0043 in the radio-ligand transportation assay.(J) The complex structure of Y510-9709 and NET determined with Cryo-EM.(K) The dose response curve of Y510-9709 in the radioligand transportation assay.

Fig. 4
Fig.4 DrugCLIP enables genome-wide virtual screening.(A) The GenPack (Generation-Packing) process for extracting pockets from AlphaFold2-predicted structures involves using Fpocket to detect initial pockets, removing sidechains, applying an AI-generative model to create molecules based on the backbone structure, and then performing sidechain packing with the generated molecules.(B) The redocking RMSD comparisons for different pocket definitions: holo-pocket, pockets detected by Fpocket on AF2-predicted structures, and pockets generated by GenPack.The red dash line indicates the RMSD threshold of 2 Å, and corresponding docking success rates are labeled above each column.(C) The EF1% comparisons for virtual screening on the DUD-E dataset using different pocket definitions: