## Abstract

In computer-aided drug discovery (CADD), virtual screening (VS) is used for identifying the drug candidates that are most likely to bind to a molecular target in a large library of compounds. Most VS methods to date have focused on using canonical compound representations (e.g., SMILES strings, Morgan fingerprints) or generating alternative fingerprints of the compounds by training progressively more complex variational autoencoders (VAEs) and graph neural networks (GNNs). Although VAEs and GNNs led to significant improvements in VS performance, these methods suffer from reduced performance when scaling to large virtual compound datasets. The performance of these methods has shown only incremental improvements in the past few years. To address this problem, we developed a novel method using multiparameter persistence (MP) homology that produces topological fingerprints of the compounds as multidimensional vectors. Our primary contribution is framing the VS process as a new topology-based graph ranking problem by partitioning a compound into chemical substructures informed by the periodic properties of its atoms and extracting their persistent homology features at multiple resolution levels. We show that the margin loss fine-tuning of pretrained Triplet networks attains highly competitive results in differentiating between compounds in the embedding space and ranking their likelihood of becoming effective drug candidates. We further establish theoretical guarantees for the stability properties of our proposed MP signatures, and demonstrate that our models, enhanced by the MP signatures, outperform state-of-the-art methods on benchmark datasets by a wide and highly statistically significant margin (e.g., 93% gain for Cleves-Jain and 54% gain for DUD-E Diverse dataset).

## 1 Introduction

Drug discovery is the early phase of the pharmaceutical R&D pipeline where machine learning (ML) is making a paradigm-shifting impact [31, 91]. Traditionally, early phases of biomedical research involve the identification of targets for a disease of interest, followed by high-throughput screening (HTS) experiments to determine hits within the synthesized compound library, i.e., compounds with high potential. Then, these compounds are optimized to increase potency and other desired target properties. In the final phases of the R&D pipeline, drug candidates have to pass a series of rigorous controlled tests in clinical trials to be considered for regulatory approval. On average, this process takes 10-15 years end-to-end and costs in excess of ~ 2 billion US dollars [10]. HTS is highly time and cost-intensive. Therefore, it is critical to find good potential compounds effectively for the HTS step for novel compound discovery, but also to speed up the pipeline and make it more cost-effective. To address this need, ML augmented virtual screening (VS) has emerged as a powerful computational approach to screen ultra large libraries of compounds to find the ones with desired properties and prioritize them for experimentation [65, 40].

In this paper, we develop novel approaches for virtual screening by successfully integrating topological data analysis (TDA) methods with ML and deep learning (DL) tools. We first produce topological fingerprints of compounds as 2*D* or 3*D* vectors by using TDA tools, i.e., multidimensional persistent homology. Then, we show that Triplet networks, (where state-of-the-art pretrained transformer-based models and modernized convolutional neural network architectures serve as the backbone and distinct topological features allow to represent support and query compounds), successfully identify the compounds with the desired properties. We also demonstrate that the applicability of topological feature maps can be successfully generalized to traditional ML algorithms such as random forests.

The distinct advantage of TDA tools, in particular persistent homology (PH), is that it enables effective integration of the domain information such as atomic mass, partial charge, bond type (single, double, triple, aromatic ring), ionization energy or electron affinity, which carry vital information regarding the chemical properties of a compound at multiple resolution levels during the graph filtration step. While common PH theory allows only one such domain function to be used in this process, with our novel multipersistence approach, we show it is possible to use more than one domain function. Topological fingerprints can effectively carry much finer chemical information of the compound structure informed by the multiple domain functions embedded in the process. Specifically, multiparameter persistence homology decomposes a 2*D* graph structure of a compound into a series of subgraphs using domain functions and generates hierarchical topological representations in multiple resolution levels. At each resolution stage, our algorithm sequentially generates finer topological fingerprints of the chemical substructures. We feed these topological fingerprints to suitable ML/DL methods, and our ToDD models achieve state-of-the-art in all benchmark datasets across all targets (See Table 1 and 2).

**The key contributions of this paper are**:

We develop a transformative approach to generate molecular fingerprints. Using multipersistence, we produce highly expressive and unique topological fingerprints for compounds independent of scale and complexity. This offers a new way to describe and search chemical space relevant to both drug discovery and development.

We bring a new perspective to multiparameter persistence in TDA and produce a computationally efficient multidimensional fingerprint of chemical data that can successfully incorporate more than one domain function to the PH process. These MP fingerprints harness the computational strength of linear representations and are suitable to be integrated into a broad range of ML, DL, and statistical methods; and open a path for computationally efficient extraction of latent topological information.

We prove that our multidimensional persistence fingerprints have the same important stability guarantees as the ones exhibited by the most currently existing summaries for single persistence.

We perform extensive numerical experiments in VS, showing that our ToDD models outperform all state-of-the-art methods by a wide margin (See Figure 1).

## 2 Related Work

### 2.1 Virtual Screening

A key step in the early stages of the drug discovery process is to find active compounds that will be further optimized into potential drug candidates. One prevalent computational method that is widely used for compound prioritization with desired properties is *virtual screening* (VS). There are two major categories, i.e., structure-based virtual screening (SBVS) and ligand-based virtual screening (LBVS) [20]. SBVS uses the 3*D* structural information of both ligand (compound) and target protein as a complex [12, 52, 86]. SBVS methods generally require a good understanding of 3*D*-structure of the target protein to explore the different poses of a compound in a binding pocket of the target. This makes the process computationally expensive. On the other hand, LBVS methods compare structural similarities of a library of compounds with a known active ligand [79, 69] with an underlying assumption that similar compounds are prone to exhibit similar biological activity. Unlike SBVS, LBVS only uses ligand information. The main idea is to produce effective fingerprints of the compounds and use ML tools to find similarities. Therefore, computationally less expensive LBVS methods can be more efficient with larger chemical datasets especially when the structure of the target receptor is not known [56].

In the last 3 decades, various LBVS methods have been developed with different approaches and these can be categorized into 3 classes depending on the fingerprint they produce: SMILES [81] and SMARTS [29] are examples of 1*D*-methods which produce 1*D*-fingerprints, compressing compound information to a vector. RASCAL [78], MOLPRINT2*D* [9], ECFP [80], CDK-graph [97],CDK-hybridization [85],SWISS [103], Klekota-Roth [53], MACSS [29], E-state [36] and SIMCOMP [37] are among 2*D* methods which uses 2*D*-structure fingerprint and graph matching. Finally, examples of 3*D*-methods are ROCS [38], USR [8], PatchSurfer [42] which use the 3*D*-structure of compounds and their conformations (3*D*-position of the compound) [84]. On the other hand, while ML methods have been actively used in the field for the last two decades, new deep learning methods made a huge impact in drug discovery process in the last 5 years [88, 52, 82]. Further discussion of state-of-the-art ML/DL methods are given in Section 6 where we compare our models and benchmark against them.

### 2.2 Topological Data Analysis

TDA and tools of persistent homology (PH) have recently emerged as powerful approaches for ML, allowing us to extract complementary information on the observed objects, especially, from graph-structured data. In particular, PH has become popular for various ML tasks such as clustering, classification, and anomaly detection, with a wide range of applications including material science [68, 43], insurance [99, 46], finance [55], and cryptocurrency analytics [33, 4, 73]. (For more details see surveys [6, 22] and TDA applications library [34]) Furthermore, it has become a highly active research area to integrate PH methods into geometric deep learning (GDL) in recent years [41, 100, 19, 23]. Most recently, the emerging concepts of *multipersistence* (MP) are proposed to advance the success of single parameter persistence (SP) by allowing the use of more than one domain function in the process to produce more granular topological descriptors of the data. However, the MP theory is not sufficiently mature as it suffers from the nonexistence of the barcode decomposition relating to the partially ordered structure of the index set {(*α*_{i}, *β*_{j})} [57, 89]. The existing approaches remedy this issue via slicing technique by studying one-dimensional fibers of the multiparameter domain [18], but choosing these directions suitably and computing restricted SP vectorizations are computationally costly which makes the approach inefficient in real life applications. There are several promising recent studies in this direction [11, 93, 24], but these approaches fail to provide a practical topological summary such as “multipersistence diagram”, and an effective MP vectorization to be used in real life applications.

### 2.3 TDA in Virtual Screening

In [16, 15, 14], the authors obtained successful results by integrating single persistent homology outputs with various ML models. Furthermore, in [50], the authors used multipersistence homology with fibered barcode approach in the 3*D* setting and obtained promising results. In the past few years, TDA tools were also successfully combined with various deep learning models for SBVS and property prediction [71, 72]. In [66, 45, 61, 95, 62], the authors successfully used TDA methods to generate powerful molecular descriptors. Then, by using these descriptors, they highly boosted the performance of various ML/DL models and outperformed the existing models in several benchmark datasets. For a discussion and comparison of TDA techniques with other approaches in virtual screening and property prediction, see the review article [70]. In this paper, we follow a different approach and propose a framework by adapting multipersistence homology to VS process which produces fine topological fingerprints which are highly suitable for ML/DL methods.

## 3 Background

We first provide the necessary TDA background for our machinery. While our techniques are applicable to various forms of data, e.g., point clouds and images (for details, see Section B.2), here we focus on the graph setup in detail with the idea of mapping the atoms and bonds that make up a compound into a set of nodes and edges that represent an undirected graph.

### 3.1 Persistent Homology

Persistent homology (PH) is a key approach in TDA, allowing us to extract the evolution of subtler patterns in the data shape dynamics at multiple resolution scales which are not accessible with more conventional, non-topological methods [17]. In this part, we go over the basics of PH machinery on graph-structured data. For further background on PH, see Appendix A.1 and [27, 30].

For a given graph , consider a nested sequence of subgraphs . For each , define an abstract simplicial complex , yielding a *filtration*, a nested sequence of simplicial complexes . This step is crucial in the process as one can inject domain information to the machinery exactly at this step by using a filtering function from domain, e.g., atomic mass, partial charge, bond type, electron affinity, ionization energy (See Appendix A.1). After getting a filtration, one can systematically keep track of the evolution of topological patterns in the sequence of simplicial complexes . A *k*-dimensional topological feature (or *k*-hole) may represent connected components (0-dimension), loops (1-dimension) and cavities (2-dimension). For each *k*-dimensional topological feature *σ*, PH records its first appearance in the filtration sequence, say , and first disappearence in later complexes, with a unique pair (*b*_{σ}, *d*_{σ}), where 1 ≤ *b*_{σ} < *d*_{σ} ≤ *N*. We call *b*_{σ} *the birth time* of *σ* and *d*_{σ} *the death time* of *σ*. We call *d*_{σ} − *b*_{σ} *the life span* (or persistence) of *σ*. PH records all these birth and death times of the topological features in *persistence diagrams*. Let 0 ≤ *k* ≤ *D* where *D* is the highest dimension in the simplicial complex . Then *k*^{th} persistence diagram . Here, represents the *k ^{th} homology group* of which keeps the information of the

*k*-holes in the simplicial complex . Most common dimensions used in practice are 0 and 1, i.e., and . For sake of notations, further we skip the dimension (subscript

*k*). With the intuition that the topological features with long life spans (persistent features) describe the hidden shape patterns in the data, these persistence diagrams provide a unique topological fingerprint of . We give the further details of the PH machinery and how to integrate domain information into the process in Appendix A.1.

### 3.2 Multidimensional Persistence

MultiPersistence (MP) significantly boosts the performance of the single parameter persistence technique described in Appendix A.1. The reason for the term “single” is that we are filtering the data in only one direction . As explained in Appendix A.1, the construction of the filtration is the key step to inject domain information to process and to find the hidden patterns of the data. If one uses a function which has valuable domain information, then this induces a single parameter filtration as above. However, various data have more than one domain function to analyze the data, and using them simultaneously would give a much better understanding of the hidden patterns. For example, if we have two functions (e.g., atomic mass and partial charge) with valuable complementary information of the network (compound), MP idea is presumed to produce a unique topological fingerprint combining the information from both functions. These pair of functions *f, g* induces a multivariate filtering function with *F*(*v*) = (*f*(*v*), *g*(*v*)). Again, one can define a set of nondecreasing thresholds and for *f* and *g* respectively. Let , i.e., . Define to be the induced subgraph of by , i.e., the smallest subgraph of generated by . Then, instead of a single filtration of complexes , we get a *bifiltration* of complexes which is a *m* × *n* rectangular grid of simplicial complexes. Again, the MP idea is to keep track of the *k*-dimensional topological features in this grid by using the corresponding homology groups (MP module).

As noted in Section 2, because of the technical problems related to partially ordered structure of the MP module, the MP theory has no sound definition yet (e.g., birth/death time of a topological feature in MP grid), and there is no effective way to facilitate this promising idea in real life applications. In the following, we overcome this problem by producing highly effective fingerprints by utilizing the *slicing* idea in the MP grid in a structured way.

## 4 New Topological Fingerprints of the Compounds with Multipersistence

ToDD framework produces fingerprints of compounds as multidimensional vectors by expanding single persistence (SP) fingerprints (Appendix A.1). While our construction is applicable and suitable for various forms of data, here we focus on graphs, and in particular, compounds for virtual screening. We obtain a 2*D* matrix (or 3D array) for each compound as its fingerprint employing 2 or 3 functions/weights (e.g., atomic mass, partial charge, bond type, electron affinity, ionization energy) to perform graph filtration. We explain how to generalize our framework to other types of data in Appendix B.2. In Appendix B.4, we construct the explicit examples of MP Fingerprints for most popular SP Vectorizations, e.g., Betti, Silhouette, Landscapes.

Our framework basically expands a given SP vectorization to a multidimensional vector by utilizing MP approach. In technical terms, by using the existing SP vectorizations, we produce multidimensional vectors by effectively using one of the filtering direction as *slicing direction* in the multipersistence module. We explain our process in three steps.

*Step 1 - Bifiltrations*: This step basically corresponds to obtaining relevant *substructures* from the given compound in an organized way. Here, we give the computationally most feasible method, called *sublevel bifiltration* with 2 functions. Depending on the task and dataset, the other filtration types or more functions/weights can be more useful. In Section B.5, we give details for other filtration methods we use in our experiments. i.e., Vietoris-Rips (distance) and weight filtration.

Let be two filtering functions with threshold sets and respectively (e.g., *f* is atomic mass, and *g* is partial charge). Let and let be the induced subgraph of by , i.e. add any edge in whose endpoints are in . Similarly, let and . Let be the induced subgraph of by . Then, define as *the clique complex* of (See Section A.1). In particular, by using the first function (*f*), we filter in one (say vertical) direction . Then, by using the second function (*g*), we filter each in horizontal direction and obtain a bifiltration . These subgraphs represent the induced substructures of the compound by using the filtering functions *f* and *g*.

In Figure 2 and 3, we give an example of sublevel bifiltration of the compound cytosine by atomic number and partial charge functions. In Figure 2, atom types are coded by their color. Atomic numbers are given in the parenthesis. White=Hydrogen (1), Gray=Carbon (6), Blue=Nitrogen (7), and Red=Oxygen (8). The decimal numbers next to atoms represent their partial charges.

*Step 2 - Persistence Diagrams*: After constructing the bifiltration , the second step is to obtain persistence diagrams for each row. By restricting the bifiltration to a single row, for each 1 ≤ *i*_{0} ≤ *m*, one obtains a single filtration in horizontal direction. This is called a *horizontal slice* in the bipersistence module. Each such single filtration induces a persistence diagram . This produces *m* persistence diagrams . Notice that one can consider as the single persistence diagram of the “substructure” filtered by the second function *g* (See Section A.1).

*Step 3 - Vectorization*: The final step is to use a vectorization on these *m* persistence diagrams. Let *φ* be a single persistence vectorization, e.g., Betti, Silhouette, Entropy, Persistence Landscape or Persistence Image. Specifically, we use Betti to ease computational complexity. By applying the chosen SP vectorization *φ* to each PD, we obtain a function where in most cases it is a single variable function on the threshold domain [0, *n*], i.e., *φ*_{i} : [1, *n*] → ℝ. The number of thresholds *m, n* are important as it determines the size of our topological fingerprint. As most such vectorizations are induced from a discrete set of points , it is common to express them as vector in the form . In the examples in Section B.4, we explain this conversion explicitly for different vectorizations. Hence, we obtain a vector of size 1 × *n* for each row 1 ≤ *i* ≤ *m*.

Now, we can define our topological fingerprint **M**_{φ} which is a 2*D*-vector (a matrix)
where is the *i*^{th}-row of **M**_{φ}. Hence, **M**_{φ} is a 2*D*-vector of size *m*×*n*. Each row is the vectorization of the persistence diagram via the SP vectorization method *φ*. We use the first filtering function *f* to get a finer look at the graph as it defines the subgraphs . Then, by using the second function *g* on each , we record the evolution of topological features in each as . While this construction gives our 2*D* (matrix) fingerprints **M**_{φ}, one can also use 3 functions/weights for filtration and obtain a finer 3*D* (array) topological fingerprint (Section B.3).

In a way, we look at with a 2*D* resolution (functions *f* and *g* as lenses) and keep track of the evolution of topological features in the induced substructures . The main advantage of this technique is that the outputs are fixed size multidimensional vectors for each dataset which are suitable for various ML/DL models.

### 4.1 Stability of the MP Fingerprints

We further show that when the source single parameter vectorization *φ* is stable, then so is its induced MP Fingerprint **M**_{φ}. (We give the details of stability notion in persistence theory and proof of the following theorem in Section B.1.)

*Let φ be a stable SP vectorization. Then, the induced MP Fingerprint* **M**_{φ} *is also stable, i.e., with the notation introduced in Section B.1, there exists* *such that for any pair of graphs* *and* , *we have the following inequality*.

## 5 Datasets

### Cleves-Jain

This is a relatively small dataset [26] that has 1149 compounds.^{*} There are 22 different drug targets, and for each one of them the dataset provides only 2-3 template active compounds dedicated for model training, which presents a few-shot learning task. All targets {*q*} are associated with 4 to 30 active compounds {*L _{q}*} dedicated for model testing. Additionally, the dataset contains 850 decoy compounds (

*D*). The aim is for each target

*q*, by using the templates, to find the actives

*L*among the pool combined with decoys

_{q}*L*∪

_{q}*D*, i.e., same decoy set

*D*is used for all targets.

### DUD-E Diverse

DUD-E (Directory of Useful Decoys, Enhanced) dataset [67] is a comprehensive ligand dataset with 102 targets and approximately 1.5 million compounds.^{*} The targets are categorized into 7 classes with respect to their protein type. The “Diverse subset” of DUD-E contains targets from each category to give a balanced benchmark dataset for VS methods. Diverse subset contains 116,105 compounds from 8 target and 8 decoy sets. One decoy set is used per target.

More detailed information about each dataset can be found in Appendix C.1.

## 6 Experiments

### 6.1 Setup

#### Macro Design

We construct different ToDD (Topological Drug Discovery) models, namely ToDD-ViT, ToDD-ConvNeXt and ToDD-RF to test the generalizability and scalability of topological features while employing different ML models and training datasets of various sizes. Many neural network architectural choices and ML models can be incorporated in our ToDD method. ToDD-ViT and ToDD-ConvNeXt are Triplet network architectures with Vision Transformer (ViT_b_16) [28] and ConvNeXt_tiny models [63], pretrained on ILSVRC-2012 ImageNet, serving as the backbone of the Triplet network. MP signatures of compounds are applied nearest neighbour interpolation to increase their resolutions to 224^{2}, followed by normalization. We only use GaussianBlur with kernel size 5^{2} and standard deviation 0.05 as a data augmentation technique. Transfer learning via fine-tuning ViT_b_16 and ConvNeXt_tiny models using Adam optimizer with a learning rate of 5e-4, no warmup or layerwise learning rate decay, cosine annealing schedule for 5 epochs, stochastic weight averaging for 5 epochs, weight decay of 1e-4, and a batch size of 64 for 10 epochs in total led to significantly better performance in Enrichment Factor and ROC-AUC scores compared to training from scratch. The performance of all models was assessed by 5-fold cross-validation (CV).

Due to structural isomerism, molecules with identical molecular formulae can have the same bonds, but the relative positions of the atoms differ [76]. ViT has much less inductive bias than CNNs, because locality and translation equivariance are embedded into each layer throughout the entire network in CNNs, whereas in ViT self-attention layers are global and only MLP layers are translationally equivariant and local [28]. Hence, ViT is more robust to distinct arrangements of atoms in space, also referred to as molecular conformation. On a small-scale dataset like Cleves-Jain, ViT exhibits impressive performance. However, the memory and computational costs of dot-product attention blocks of ViT grow quadratically with respect to the size of input, which limits its application on large-scale datasets [60, 83]. Another major caveat is that the number of triplets grows cubically with the size of the dataset. Since ConvNeXt depends on a fully-convolutional paradigm, its inherently efficient design is viable on large-scale datasets like DUD-E Diverse. As depicted in Figure 4, ToDD-ViT and ToDD-ConvNeXt project semantically similar MP signatures of compounds from data manifold onto metrically close embeddings using triplet margin loss with margin *α* = 1.0 and norm *p* = 2 as provided in Equation 1. Analogously, semantically different MP signatures are projected onto metrically distant embeddings.

#### Sampling Strategy

Learning metric embeddings via triplet margin loss on large-scale datasets poses a special challenge in sampling all distinct triplets (*x*, *x*^{+}, *x*^{−}), and collecting them into a single database causes excessive overhead in computation time and memory. Let *P* be a set of compounds, *x*_{i} denotes a compound that inhibits the drug target *i*, and *d*_{ij} = *d*(*x*_{i}, *x*_{j}) ∈ ℝ denotes a pairwise distance measure which estimates how strongly *x*_{i} ∈ *P* is similar to *x*_{j} ∈ *P*. The distance metric can be chosen as Euclidean distance, cosine similarity or dot-product between embedding vectors. We use pairwise Euclidean distance computed by the pretrained networks in the implementation. Since triplets (*x*, *x*^{+}, *x*^{−}) with *d*(*x*, *x*^{−}) > *d*(*x*, *x*^{+}) + *α* have already negative queries sufficiently distant to the anchor compounds from the support set in the embedding space, they are not sampled to create the training dataset. We only sample triplets that satisfy *d*(*x*, *x*^{−}) < *d*(*x*, *x*^{+}) (where negative query is closer to the anchor than the positive) and *d*(*x*, *x*^{+}) < *d*(*x*, *x*^{−}) < *d*(*x*, *x*^{+}) + *α* (where negative query is more distant to the anchor than the positive, but the distance is less than the margin).

#### Enrichment Factor

(EF) is the most common performance evaluation metric for VS methods [90]. VS method *φ* ranks compounds in the database by their similarity scores. We measure the similarity score using the inverse of Euclidean distance between the embeddings of an anchor and drug candidate. Let *N* be the total number of ligands in the dataset, *A*_{φ} be the number of true positives (i.e., correctly predicted active ligands) ranked among the top *α*% of all ligands (*N*_{α} = *N* · *α*%) and *N*_{actives} be the number of active ligands in the whole dataset. Then, . In other words, *EF*_{α%} interprets as how much VS method *φ enrich* the possibility of finding active ligand in the first *α*% of all ligands with respect to the random guess. This method is also known as *precision at k* in the literature. With this definition, the max score for *EF*_{α%} is , i.e., 100 for *EF*_{1%} and 20 for *EF*_{5%}.

### 6.2 Experimental Results

We compare our methods against the 23 state-of-the-art baselines (see Appendix C.2).

Relative gains are relative to the next best performing model. Based on the results (mean and standard deviation of EF scores evaluated by CV) reported in Table 1 and 2, we observe the following:

ToDD models consistently achieve the best performance on both Cleves-Jain and DUD-E Diverse datasets across all targets and

*EF*_{α%}levels.ToDD learns hierarchical topological representations of compounds using their atoms’ periodic properties, and captures the complex chemical properties essential for high-throughput VS. These strong hierarchical topological representations enable ToDD to become a model-agnostic method that is extensible to state-of-the-art neural networks as well as ensemble methods like random forests (RF).

For small-scale datasets such as Cleves-Jain, RF is less accurate than ViT despite regularization by bootstrapping and using pruned, shallow trees, because small variations in the data may generate significantly different decision trees. For large-scale datasets such as DUD-E Diverse, ToDD-RF and ToDD-ConvNeXt exhibit comparable performances except for: CP3A4, GCR and HIVRT. We conclude that transformer-based models are more robust than convolutional models and RF classifiers despite increased computation time.

### 6.3 Computational Complexity

Computational complexity (CC) of MP Fingerprint depends on the vectorization] *Ψ* used and the number *d* of the filtering functions one uses. CC for a single persistence diagram *PD*_{k} is [74], where is the number of *k*-simplices. If *r* is the resolution size of the multipersistence grid, then where *C*_{Ψ}(*m*) is CC for *Ψ* and *m* is the number of barcodes in *PD*_{k}, e.g., if *Ψ* is Persistence Landscape, then *C*_{Ψ}(*m*) = *m*^{2} [13] and hence CC for MP Landscape with three filtering functions (*d* = 3) is . On the other hand, for MP Betti summaries, one does not need to compute persistence diagrams, but the rank of homology groups in the MP module. Hence, for MP Betti summary, the computational complexity is indeed much lower by using minimal representations [58, 51]. To expedite the execution time, the feature extraction task is distributed across the 8 cores of an Intel Core i7 CPU (100GB RAM) running in a multiprocessing process. See Appendix C.4 for an additional analysis of computation time to extract MP fingerprints from the datasets. Furthermore, all ToDD models require substantially fewer computational resources during training compared to current graph-based models that encode a compound through mining common molecular fragments, a.k.a., motifs [47]. Training time of ToDD-ViT and ToDD-ConvNeXt for each individual drug target takes less than 1 hour on a single GPU (*N*VIDIA RTX 2080 Ti).

### 6.4 Ablation Study

We tested a number of ablations of our model to analyze the effect of its individual components and to further investigate the effectiveness of our topological fingerprints.

#### Multimodal Learning

We first address the question of how adding different domain information improves the model performance. In Appendix C.3, we demonstrate one-by-one the importance of each modality (atomic mass, partial charge and bond type) used for graph filtration to the classification of each target. We find that their importance varies across targets in a unimodal setting, but the orthogonality of these information sources offers significant gain in EF scores when the MP signatures learned from each modality are integrated into a joined multimodal representation. Tables 5, 6, 7 and 8 provide detailed results for the performance of each modality across all drug targets.

#### Morgan Fingerprints

We quantitatively analyze the explainability of our models’ success by replacing topological fingerprints computed via multiparameter persistence with the most popular fingerprinting method: Morgan fingerprints. Our results in Appendix C.5 show that ToDD engineers features that represent the underlying attributes of compounds significantly better than the Morgan algorithm to identify the active compounds across all drug targets. We provide detailed tabulated results of our benchmarking study across all drug targets in Tables 10 and 11.

#### Network Architecture

We investigated ways to leverage deep metric learning by architecting *i*) a Siamese network trained with contrastive loss, *ii*) a Triplet network trained with triplet margin loss, and *iii*) a Triplet network trained with circle loss. Based on our preliminary experiments, the embeddings learned by *i* and *iii* provide sub-par results for compound classification, hence we use *ii*.

## 7 Conclusion

We have proposed a new idea of the topological fingerprints in VS, allowing for deeper insights into structural organization of chemical compounds. We have evaluated the predictive performance of our ToDD methodology for computer aided drug discovery on benchmark datasets. Moreover, we have demonstrated that our topological descriptors are model-agnostic and have proven to be exceedingly competitive, yielding state-of-the-art results unequivocally over all baselines. A future research direction is to enrich ToDD with different VS modalities, and use it on ultra-large virtual compound libraries. It is important to note that this new way of capturing the chemical information of compounds provides a transformative perspective to every level of the pharmaceutical pipeline from the very early phases of drug discovery to the final stages of formulation in development.

## 8 Acknowledgments

YG, BC, YC and ISD were partially supported by Simons Collaboration Grant # 579977, the National Science Foundation (NSF) under award # ECCS 2039701, the Department of the Navy, Office of Naval Research (ONR) under ONR award # N00014-21-1-2530. Part of YG’s contribution is also based upon work supported by (while serving at) the NSF. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF and/or the ONR.

## Appendix

## A Topological Data Analysis (TDA)

## A.1 Single Parameter Persistent Homology

Here, we give further details on single parameter persistent homology. To sum up, PH machinery is a 3-step process. The first step is the *filtration* step, where one can integrate the domain information to the process. The second step is the *persistence diagrams*, where the machinery records the evolution of topological features (birth/death times) in the filtration, sequence of the simplicial complexes. The final step is the *vectorization* (fingerprinting) where one can convert these records to a function or vector to be used in suitable ML models.

## Constructing Filtrations

As PH is basically the machinery to keep track of the evolution of topological features in a sequence, the most important step is the construction of the sequence . This is the key step where one can inject the valuable domain information to the PH process by using important domain functions (e.g., atomic mass, partial charge). While there are various filtration techniques used for PH machinery on graphs [5, 41], we will focus on two most common methods: *Sublevel/superlevel filtration* and *Vietoris-Rips (VR) filtration*.

For a given unweighted graph (compound) with the set of nodes (atoms) and the set of edges (bonds), the most common technique is to use a filtering function with a choice of thresholds where . For , let (sublevel sets for *f*). Here, in VS problem, this filtering function *f* can be atomic mass, partial charge, bond type, electron affinity, ionization energy or another important function representing chemical properties of the atoms. One can also use the natural graph induced functions like node degree, betweenness, etc. Let be the induced subgraph of by , i.e., where . This process yields a nested sequence of subgraphs . To obtain a filtration, next step is to assign a simplicial complex to the subgraph . One of the most common techniques is the clique complexes [5]. The clique complex is a simplicial complex obtained from by assigning (filling with) a *k*-simplex to each complete (*k* + 1)-complete subgraph in , e.g., a 3-clique, a complete 3-subgraph, in will be filled with a 2-simplex (triangle). This technique is generally known as *sublevel filtration* with clique complexes. As *f*(*v*_{i}) ≤ *α*_{i} condition in the construction gives sublevel filtration, one can similarly use *f*(*v*_{i}) ≥ *α*_{i} condition to define *superlevel filtration*. Similarly, for a weighted graph (bond strength), sublevel filtration on edge weights provides corresponding filtration reflecting the domain information stored in the edge weights [5].

While sublevel/superlevel filtration with clique complexes is computationally cheaper and more common in practise, in this paper, we will essentially use a distance-based filtration technique called *Vietoris-Rips (VR) filtration* where the pairwise distances between the nodes play key role. This technique is computationally more expensive, but gives much finer information about the graph’s intrinsic properties [2]. For a given graph , we define the distance between *d*(*v*_{r}, *v*_{s}) = *d*_{rs} where *d*_{rs} is the smallest number of edges required to get from *v*_{r} to *v*_{s} in . Then, let be the graph where , i.e. and with and . In other words, we start with the nodes of , then for any pair of vertices *v*_{r}, *v*_{s} with distance *d*_{rs} ≤ n in , we add an edge *e*_{rs} to the graph Γ_{n}. Then, define the simplicial complex , the clique complex of Γ_{n}. This defines a filtration Δ_{0} ⊂ Δ_{1} ⊂ · · · ⊂ Δ_{K} where *K* = max *d*_{rs}, i.e. the distance between farthest two nodes in the graph . Hence, for *n* ≥ *K*, Δ_{n} = Δ_{K} which is a (*m* − 1)-simplex as Γ_{K} is complete *m*-graph where . In particular, in this setting, we consider the vertex set as a point cloud where the distances between the points induced from the graph . In graph setting, V *R*-filtration is also known as *power filtration* as the graph Γ_{n} is also called , *n*^{th} power of .

## Persistence Diagrams

The second step in PH process is to obtain persistence diagrams (PD) for the filtration Δ_{0} ⊂ Δ_{1} ⊂ … ⊂ Δ_{K}. As explained in Section 3.1, PDs are collection of 2-tuples, marking the birth and death times of the topological features appearing in the filtration, i.e. for *b*_{σ} ≤ *i* < *d*_{σ}. This step is pretty standard and there are various software libraries for this task [74].

## Vectorizations (Fingerprinting)

While PH extracts hidden shape patterns from data as persistence diagrams (PD), PDs being collection of points in ℝ^{2} by itself are not very practical for statistical and ML purposes. Instead, the common techniques are by faithfully representing PDs as kernels [54] or vectorizations [39]. One can consider this step as converting PDs into a useful format to be used in ML process as fingerprints of the dataset. This provides a practical way to use the outputs of PH in real life applications. *Single Persistence Vectorizations* transform obtained PH information (PDs) into a function or a feature vector form which are much more suitable for ML tools than PDs. Common single persistence (SP) vectorization methods are Persistence Images [3], Persistence Landscapes [13], Silhouettes [21], Betti Curves and various Persistence Curves [25]. These vectorizations define a single variable or multivariable functions out of PDs, which can be used as fixed size 1*D* or 2*D* vectors in applications, i.e 1 × *n* vectors or *m* × *n* vectors. For example, a Betti curve for a PD with *n* thresholds can also be expressed as 1 × *n* size vectors. Similarly, Persistence Images is an example of 2*D* vectors with the chosen resolution (grid) size. See the examples given in Section B.4 for further details.

## B Multiparameter Persistence (MP) Fingerprints

## B.1 Stability of MP Fingerprints

## Stability of Single Persistence Vectorizations

A given PD vectorization *φ* can be considered as a map from space of persistence diagrams to space of functions, and the stability intuitively represents the continuity of this operator. In other words, stability question is whether a small perturbation in PD cause a big change in the vectorization or not. To make this question meaningful, one needs to define what “small perturbation” means in this context, i.e., a metric in the space of persistence diagrams. The most common such metric is called *Wasserstein distance* (or matching distance) which is defined as follows.

Let and be persistence diagrams two datasets and (We omit the dimensions in PDs). Let and where Δ^{±} represents the diagonal (representing trivial cycles) with infinite multiplicity. Here, represents the birth and death times of a topological feature *σ*_{j} in . Let represent a bijection (matching). With the existence of the diagonal Δ^{±} in both sides, we make sure the existence of these bijections even if the cardinalities and are different.

*Let* *be persistence diagrams of the datasets* , *and* **M** = {*ϕ*} *represent the space of matchings as described above. Then, the p ^{th} Wasserstein distance* defined as

Now, we define stability of vectorizations. A vectorization can be considered as an operator from space of persistence diagrams **P** to space of functions (or vectors) **Y**, e.g., *Ψ* : **P** → **Y**. In particular, when *Ψ* is persistence landscape, and when *Ψ* is Betti summary, then **Y** = ℝ^{m} (See MP Examples in Section B.4) Stability of vectorization *Ψ* basically corresponds to the continuity of *Ψ* as an operator. Let d(., .) be a suitable metric on the space of vectorizations used. Then, we define the stability of *Ψ* as follows:

*Let Ψ* : **P** → **Y** *be a vectorization for single persistence diagrams. Let* *be the metrics on* **P** *and* **Y** *respectively as described above. Let* . *Then*, *Ψ is called stable if*

Here, the constant *C* > 0 is independent of . This stability inequality interprets as the changes in the vectorizations are bounded by the changes in PDs. Two nearby persistence diagrams are represented by nearby vectorizations. If a given vectorization *φ* holds such a stability inequality for some *d* and , we call *φ* a *stable vectorization* [7]. Persistence Landscapes [13], Persistence Images [3], Stabilized Betti Curves [48] and several Persistence curves [25] are among well-known examples of stable vectorizations.

Now, we are ready to prove the stability of MP Fingerprints given in Section 4.1

Let and be two graphs. Let *φ* be a stable SP vectorization with the stability equation
for some 1 ≤ *p*_{φ} ≤ ∞. Here, represent the corresponding vectorizations for and represents Wasserstein-*p* distance as defined in Section B.1.

Now, let be a filtering function with threshold set . Then, define the sublevel vertex sets . For each , construct the induced VR-filtration as before. For each 1 ≤ *i*_{0} ≤ *m*, we will have persistence diagram of the filtration .

We define the induced matching distance between the multiple persistence diagrams as

Now, we define the distance between induced MP Fingerprints as

*Let φ be a stable SP vectorization. Then, the induced MP Fingerprint* **M**_{φ} *is also stable, i.e., with the notation above, there exists* *such that for any pair of graphs* *and* , *we have the following inequality*.

*Proof:* As *φ* is a stable SP vectorization, by Equation 2, for any 1 ≤ *i* ≤ *m*, we have for some *C*_{φ} > 0, where is Wasserstein-*p* distance. Notice that the constant *C*_{φ} > 0 is independent of *i*. Hence,
where the first and last equalities are due to Equation 3 and Equation 4, while the inequality follows from Equation 2 which is true for any i. This concludes the proof of the theorem. □

## B.2 MP Fingerprint for Other Types of Data

So far, to keep the exposition focused on VS setting, we described our construction only in the graph setup. However, our framework is suitable for various types of data. Let be a an image data or a point cloud. Let and be two filtering functions on . e.g., grayscale function for image data, or density function on point cloud data.

Let be the filtering function with threshold set .Let . Then, we get a filtering of as nested subspaces . By using the second filtering function, we obtain finer filtrations for each subspace where 1 ≤ *i* ≤ *m*. In particular, fix 1 ≤ *i*_{0} ≤ *m* and let be the threshold set for the second filtering function *g*. Then, by restricting *g* to , we get a filtering function on , i.e., which produces filtering . By inducing a simplicial complex for each , we get a filtration . This filtration results in a persistence diagram (PD) . For each 1 ≤ *i* ≤ *m*, we obtain . Note that after getting via *f* instead of using second filtering function *g*, one can apply Vietoris-Rips construction based on distance for each in order to get a different filtration .

By using *m* PDs, we follow a similar route to define our MP Fingerprints. Let *φ* be a single persistence vectorization. By applying the chosen SP vectorization *φ* to each PD, we obtain a function on the threshold domain [*β*_{1}, *β*_{n}], which can be expresses as a 1*D* (or 2*D*) vector in most cases (Section B.4). Let be the corresponding 1 × *k* vector for the function *φ*_{i}. Define the corresponding MP Fingerprint **M**_{φ} as where is the *i*^{th} row of **M**_{φ}. In particular, **M**_{φ} is a 2*D*-vector (a matrix) of size *m* × *k* where *m* is the number of thresholds for the first filtering function *f*, and *k* is the length of the vector .

## B.3 3*D* or higher dimensional MP Fingerprints

If one wants to use two filtering functions and get 3*D*-vectors as the topological fingerprint of the process, the idea is similar. Let be two filtering functions with threshold sets and respectively. Let and *g*(*v _{r}*) ≤

*β*}. Again, compute all the pairwise distances

_{j}*d*(

*v*

_{r},

*v*

_{s}) =

*m*

_{rs}in before defining simplicial complexes. Then, for each

*i*

_{0},

*j*

_{0}, obtain a VR-filtration for the vertex set with distances

*d*(

*v*

_{r},

*v*

_{s}) =

*m*

_{rs}in . Compute the persistence diagram for the filtration . This gives

*m*×

*n*persistence diagrams . After vectorization, we obtain a 3

*D*-vector of size

*m*×

*n*×

*r*as before.

## B.4 Examples of MP Fingerprints

Here, we give explicit constructions of MP Fingerprints for most common SP vectorizations. As noted above, the framework is generalizable and can be applied to most SP vectorization methods. In all the examples below, we use the following setup: Let be a graph, and let be two filtering functions with threshold sets and respectively. As explained above, we first apply sublevel filtering with *f* to get a sequence of nested subgraphs, . Then, for each , we apply sublevel filtration with *g* to get persistence diagram . Therefore, we will have *m* PDs. In the examples below, for a given SP vectorization *φ*, we explain how to obtain a vector , and define the corresponding MP Fingerprint **M**_{φ}. Note that we skip the homology dimension (subscript *k* for in the discussion. In particular, for each dimension *k* = 0, 1,. . ., we will have one MP Fingerprint (a matrix) corresponding to . The most common dimensions are *k* = 0 and *k* = 1 in applications.

## MP Landscapes

Persistence Landscapes λ are one of the most common SP vectorizations introduced by [13]. For a given persistence diagram , λ produces a function by using generating functions Λ_{i} for each , i.e., Λ_{i} : [*b*_{i}, *d*_{i}] → ℝ is a piecewise linear function obtained by two line segments starting from (*b*_{i}, 0) and (*d*_{i}, 0) connecting to the same point . Then, the *Persistence Landscape* function for *t* ∈ [*ϵ*_{1}, *ϵ*_{q}] is defined as
where are thresholds for the filtration used.

Considering the piecewise linear structure of the function, is completely determined by its values at 2*q* − 1 points, i.e., where *ϵ*_{k.5} = (*ϵ*_{k} + *ϵ*_{k+1})/2. Hence, a vector of size 1 × (2*q* − 1) whose entries the values of this function would suffice to capture all the information needed, i.e.

Considering we have threshold set for the second filtering function *g*, will be a vector of size 1 × 2*n* − 1. Then, as for each 1 ≤ *i* ≤ *m*, MP Landscape would be a 2*D*-vector (matrix) of size *m* × (2*n* − 1).

## MP Persistence Images

Next SP vectorization in our list is Persistence Images [3]. Different than the most SP vectorizations, Persistence Images produces 2*D*-vectors. The idea is to capture the location of the points in the persistence diagrams with a multivariable function by using the 2*D* Gaussian functions centered at these points. For , let *ϕ*_{i} represent a 2*D*-Gaussian centered at the point (*b _{i}, d_{i}*) ∈ ℝ

^{2}. Then, one defines a multivariable function,

*Persistence Surface*, where

*w*is the weight, mostly a function of the life span

_{i}*d*

_{i}−

*b*. To represent this multivariable function as a 2

_{i}*D*-vector, one defines a

*k*×

*l*grid (resolution size) on the domain of i.e., threshold domain of Then, one obtains the

*Persistence Image*, a 2

*D*-vector (matrix) of size

*k*×

*l*where and Δ

_{rs}is in the is the corresponding pixel (rectangle)

*k*×

*l*grid.

This time, the resolution size *k* × *l* is independent of the number of thresholds used in the filtering, the choice of *k* and *l* is completely up to the user. Recall that by applying the first function *f*, we have the nested subgraphs . For each , the persistence diagram obtained by sublevel filtration with *g* induces a 2*D* vector of size *k* × *l*. Then, define MP Persistence Image as , where is the *i*^{th}-floor of the matrix **M**_{μ}. Hence, would be a 3*D*-vector of size *m* × *k* × l where *m* is the number of thresholds for the first function *f* and *k* × *l* is the chosen resolution size for the Persistence Image .

## MP Betti Summaries

Next, we give an important family of SP vectorizations, Persistence Curves [25]. This is an umbrella term for several different SP vectorizations, i.e., Betti Curves, Life Entropy, Landscapes, et al. Our MP Fingerprint framework naturally adapts to all Persistence Curves to produce multidimensional vectorizations. As Persistence Curves produce a single variable function in general, they all can be represented as 1*D*-vectors by choosing a suitable mesh size depending on the number of thresholds used. Here, we describe one of the most common Persistence Curves in detail, i.e., Betti Curves. It is straightforward to generalize the construction to other Persistence Curves.

Betti curves are one of the simplest SP vectorization as it gives the count of topological feature at a given threshold interval. In particular, *β*_{k}(Δ) is the total count of *k*-dimensional topological feature in the simplicial complex Δ, i.e., *β*_{k}(Δ) = *rank*(*H*_{k}(Δ)). Then, is a step function defined as
for *t* ∈ [*ϵ*_{i}, *ϵ*_{i+1}), where represents the thresholds for the filtration used. Considering this is a step function where the function is constant for each interval [*ϵ*_{i}, *ϵ*_{i+1}), it can be perfectly represented by a vector of size 1 × *q* as .

Then, with the threshold set for the second filtering function *g*, will be a vector of size 1 × *n*. Then, as for each 1 ≤ *i* ≤ *m*, MP Betti Summary would be a 2*D*-vector (matrix) of size *m* × *n*. In particular, each entry **M**_{β} = [*m*_{ij}] is just the Betti number of the corresponding clique complex in the bifiltration , i.e., . This matrix **M**_{β} is also called *bigraded Betti numbers* in the literature, and computationally much faster than other vectorizations [58, 51].

## B.5 MP Vectorization with Other Filtrations

In our paper, other than the simple bifiltration explained in Section 4, we also used the following two filtrations. In the Vietoris-Rips filtration, we use graph geodesic (VR-filtration) as our natural slicing direction. The motivation for this choice is that *VR-filtration captures fine intrinsic structure of the graph by using the pairwise distances between the nodes (atoms)*. With the weight filtration, we can utilize the bond strength of compounds effectively in our construction.

## Vietoris-Rips filtration

Here, we describe our VR construction for 2*D* multipersistence. The construction can easily be extended to 3*D* or higher dimensions (See Appendix B.3). Let be a graph, and let be a filtering function (e.g., atomic mass, partial charge, bond type, electron affinity, ionization energy) with threshold sets . Let . This defines a hierarchy among the nodes with respect to the function *f*.

Before constructing simplicial complexes, compute the distances between each node in graph , i.e., *d*(*v*_{r}, *v*_{s}) = *d*_{rs} is the length of the shortest path from *v*_{r} to *v*_{s} where each edge has length 1. Let *K* = *max d*_{rs}. Then, for each 1 ≤ *i*_{0} ≤ *m*, define VR-filtration for the vertex set with the distances *d*(*v*_{r}, *v*_{s}) = *d*_{rs} as described in Section 3.1, i.e., (See Figure 5). This gives *m* × (*K* + 1) simplicial complexes {Δ_{ij}} where 1 ≤ *i* ≤ *m* and 0 ≤ *j* ≤ *K*.

This is called the *bipersistence module*. One can imagine increasing sequence of as vertical direction, and induced VR-complexes {Δ_{ij}} as the horizontal direction. In our construction, we fix the slicing direction as the horizontal direction (VR-direction) in the bipersistence module, and obtain the persistence diagrams in these slices.

In the toy example in Figure 5, we use a small graph instead of a real compound to keep the exposition simple. Our sublevel filtration (vertical direction) comes from the degree function. Degree of a node is the number of edges incident to it. In the first column, we simply see the single sublevel filtration of by the degree function. In each row, we develop VR-filtration of the subgraph by using the graph distances between the nodes. Here, graph distance between nodes means the length of the shortest path (geodesic) in the graph where each edge is taken as length 1. Then, in the second column, we add the edges for the nodes whose graph distance is equal to 2. In the third column, we add the (blue) edges for the nodes whose graph distance is equal to 3. Finally, in the last column, we add the (red) edges for the nodes whose graph distance is equal to 4. By construction, all the graphs in the last column must be a complete graph as there is no more edge to add.

After getting the bifiltration, the following steps are the same as in Section 4. In particular, for each 1 ≤ *i*_{0} ≤ *m*, one obtains a single filtration in horizontal direction. This single filtration gives a persistence diagram as before. Hence, one obtains *m* persistence diagrams . Again, by applying a vectorization *φ*, one obtains *m* row vectors of fixed size *r*, i.e. . This induces a 2*D*-vector **M**_{φ} (a matrix) of size *m* × (*K* + 1) as before.

While computationally more expensive than others, VR-filtration can be very effective for some VS tasks, as it detects the graph distances between atoms, and size of the topological features [59, 2]. Note that VR-filtration when used for unweighted graphs with graph distance is known as “power filtration” in the literature. For further details on VR-filtration, see [5, 27].

## Weight filtration

For a given weighted graph , it is common to use edge weights to describe filtration. For example, in our case, one can take bond strength in the compounds as edge weights. By choosing the threshold set similarly with . For , let . Let be a subgraph of induced by . This induces a nested sequence of subgraphs .

In the case of weighted graphs, one can apply the MP vectorization framework just by replacing the first filtering (via *f*) with weight filtering. In particular, let be a filtering function with threshold set . Then, one can first apply weight filtering to get as above, and then apply *f* to each to get a bilfiltration (*m* × *n* resolution). One gets *m* PDs as and induce the corresponding **M**_{φ}. Alternatively, one can change the order by applying *g* first, and get a different filtering induced by *g*. Then, apply to edge weight filtration to any , one gets a bifiltration (*n* × *m* resolution) this time. As a result, one gets *n* PDs as and induce the corresponding **M**_{φ}. The difference is that in the first case (first apply weights, then *g*), the filtering function plays more important role as **M**_{φ} uses while in the second case (first apply *g*, then apply weights) weights have more important role as **M**_{φ} is induced by . Note also that there is a different filtration method for weighted graphs by applying the following the following VR-complexes method.

In our applications, we used weight filtration to express bond strength in the compounds. Single bond has weight 1, double bond has weight 2, triple bond has weight 3, and finally aromatic bond has weight 4 on the edges.

## C Further Experimental Results

## C.1 Dataset Statistics

## C.2 Baselines

We compare our methods against the 23 state-of-the-art baselines including USR [8], ROCS [38], PS [42], GZD [92], PH_VS [50], USR + GZD [84], USR + PS [84], USR + ROCS [84], GZD + PS [84], GZD + ROCS [84], PS + ROCS [84], Findsite [101], Fragsite [102], Gnina [87], GOLD-EATL [96], Glide-EATL [96], CompM [96], CompScore [75], CNN [77], DenseFS [44], SIEVE-Score [98], DeepScore [94], and RF-Score-VSv3 [98].

In particular„ we compare our methods against the well-known 3*D*-methods Ultrafast Shape Recognition (USR) [8], shape-based, ligand-centric method (ROCS) [38], PatchSurfer (PS) [42], Zernike (GZD) [92] and PH_VS [50] in Cleves-Jain dataset. In Table 1, we report the performances of all these 3*D* methods with 50 conformations [84] except PH_VS with 1 conformation [50].

In Table 2, we compare our models against other state-of-the-art VS methods on DUD-E Diverse dataset. All of these ML methods came in recent years. Among these, CNN [77] uses a convolutional neural network based framework with GPU-accelerated layer (i.e., MolGridDataLayer) to define a scoring function for protein-ligand interactions. DenseFS [44] improves the previous model CNN by a densely connected convolutional neural network architecture. Later, Sieve-Score proposes an effective scoring function: similarity of interaction energy vector score (SIEVE) [98] where they extract van der Waals, Coulomb and hydrogen bonding interactions to represent docking poses for ML. Random Forest Based Protein-ligand Scoring Function (RF-Score) [98] is another VS method proposed in the same work. Compscore [75] uses genetic algorithms to find the best combination of scoring components. Recently, energy auxiliary terms learning (EATL) [96] proposes an approach based on the scoring components extracted from the output of the scoring functions implemented in several popular docking programs like Genetic Optimisation for Ligand Docking (GOLD) [49], Molecular Operating Environment (MOE) [1] and Grid-based Ligand Docking with Energetics (GLIDE) [32, 35]. In the same work, they also combine these 3 methods and produced comprehensive EATL models, CompF and CompM. DeepScore [94] defines a deep learning based target specific scoring function. Findsite [101] proposes a threading/structure based method, where they improved it later with Fragsite [102] by using tree regression ML framework. Finally, Gnina [87] is a recently released molecular docking software, which uses deep convolutional networks to score protein-ligand structures. Note that all methods use 5-fold CV except Findsite-Fragsite (3-fold CV) and Gnina.

## C.3 Model performance across different modalities

Table 5, 6, 7, 8 show detailed ablations of the modalities used in the graph filtration step of ToDD. The success of each periodic property varies per drug target and trained ML model. However merging MP fingerprints derived from each one of these domain functions used for graph filtration has always improved the performance.

Our results also show that MP fingerprints ensure making successful predictions while training fewlabeled data such as Cleves-Jain (with 2-3 labeled compounds per drug target) using a transformer-based model.

RF shows worse performance than ViT on small-scale dataset such as Cleves-Jain, despite regularization by bootstrapping and using pruned, shallow trees. Additionally, RF is more sensitive to the small variations in the training set, and imbalanced data can hamper its accuracy. In order to effectively handle the large-scale datasets that have long-tailed distributions, we undersample from the majority class (*d*ecoys). Specifically, while training RF for the binary classification task on the drug targets of DUD-E Diverse (class distributions are summarized in Table 4), we use 80% of the active compounds and the same number of randomly chosen decoys for training. Undersampling decoys to avoid heavy class imbalance achieves better trade-offs between the accuracies of active compounds and decoys.

## C.4 Computation Time

See 6.3 in the main text for details.

## C.5 Model performance using Morgan fingerprints

See 6.4 in the main text for details.

## D Societal Impact and Limitations

## D.1 Societal Impact

We perform in silico experiments and use high-throughput screening to recognize active compounds that can bind to a drug target of interest, e.g., an enzyme or protein receptor without conducting research on any human or animal subjects. Our overarching goal is to augment the capabilities of AI to enhance the in silico virtual screening and drug discovery processes, develop new drugs that have less side effects but are more effective to cure diseases, and minimize the participation of human and animal subjects as much as possible to ensure their humane and proper care.

## D.2 Limitations

We discuss in detail the computational complexity of our model in 6.3. Our model is versatile and can be scaled for large libraries by customizing the allocated computational resources. Please note that the analysis in C.4 shows the execution time of our computation pipeline when the feature extraction task is distributed across 8 cores of a single Intel Core i7 CPU. It is possible to parallelize computationally costlier operations such as VR-filtration by allocating more CPU cores on the HPC platform and optimize array operations (e.g., numpy) via the joblib library. Furthermore, all ToDD models require substantially fewer computational resources during training compared to current graph-based models that encode a compound through mining common molecular fragments, a.k.a., motifs [47]. For instance, training a motif based GNN on GuacaMol dataset which has approximately 1.5M drug-like molecules takes 130 hours of GPU time [64]. In contrast, once we generate the topological fingerprints via Vietoris-Rips filtration, training time of ToDD-ViT and ToDD-ConvNeXt for each individual drug target takes less than 1 hour on a single GPU (*N*VIDIA RTX 2080 Ti).

## Checklist

For all authors…

Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes]

Did you describe the limitations of your work? [Yes] See Appendix D.

Did you discuss any potential negative societal impacts of your work? [Yes] See Appendix D.

Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

If you are including theoretical results…

Did you state the full set of assumptions of all theoretical results? [Yes]

Did you include complete proofs of all theoretical results? [Yes] Given in Appendix B.1

If you ran experiments…

Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] Dataset links are provided. See Section 5.

Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Section 6.1

Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [No] We reported the standard deviation of our experiments evaluated by 5-fold cross-validation. See Table 1 and 2 in Section 6.2.

Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Section 6.3

If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

If your work uses existing assets, did you cite the creators? [Yes]

Did you mention the license of the assets? [No] They use public-domain-equivalent license.

Did you include any new assets either in the supplemental material or as a URL? [No]

Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [N/A]

Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A]

If you used crowdsourcing or conducted research with human subjects…

Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A]

Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]

Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]

## Footnotes

coskunuz{at}utdallas.edu, bulent.kiziltan{at}novartis.com

36th Conference on Neural Information Processing Systems (

*N*eurIPS 2022).↵

^{*}Cleves-Jain dataset: https://www.jainlab.org/Public/SF-Test-Data-DrugSpace-2006.zip↵

^{*}DUD-E Diverse dataset: http://dude.docking.org/subsets/diverse

## References

- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].↵
- [76].↵
- [77].↵
- [78].↵
- [79].↵
- [80].↵
- [81].↵
- [82].↵
- [83].↵
- [84].↵
- [85].↵
- [86].↵
- [87].↵
- [88].↵
- [89].↵
- [90].↵
- [91].↵
- [92].↵
- [93].↵
- [94].↵
- [95].↵
- [96].↵
- [97].↵
- [98].↵
- [99].↵
- [100].↵
- [101].↵
- [102].↵
- [103].↵