Prioritizing virtual screening with interpretable interaction fingerprints

Machine learning-based drug discovery success depends on molecular representation. Yet traditional molecular fingerprints omit both the protein and pointers back to structural information that would enable better model interpretability. Therefore, we propose LUNA, a Python 3 toolkit that calculates and encodes protein-ligand interactions into new hashed fingerprints inspired by Extended Connectivity Finger-Print (ECFP): EIFP (Extended Interaction FingerPrint), FIFP (Functional Interaction FingerPrint), and Hybrid Interaction FingerPrint (HIFP). LUNA also provides visual strategies to make the fingerprints interpretable. We performed three major experiments exploring the fingerprints’ use. First, we trained machine learning models to reproduce DOCK3.7 scores using 1 million docked Dopamine D4 complexes. We found that EIFP-4,096 performed (R2 = 0.61) superior to related molecular and interaction fingerprints. Secondly, we used LUNA to support interpretable machine learning models. Finally, we demonstrate that interaction fingerprints can accurately identify similarities across molecular complexes that other fingerprints over-look. Hence, we envision LUNA and its interface fingerprints as promising methods for machine learning-based virtual screening campaigns. LUNA is freely available at https://github.com/keiserlab/LUNA.


Introduction
In the last few decades, machine learning (ML), and in particular deep neural networks (DNN), have opened up new opportunities in all stages of drug discovery and development. [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19] In particular, ML has the potential to revolutionize structural-based virtual screening (SBVS) campaigns through the prioritization and automatic selection of promising compounds. SBVS nowadays 20-23 commonly involves a manual and intensive process called hitpicking, in which researchers evaluate and determine hit selection through molecular graphics programs to inspect the binding modes of hundreds of top-scoring compounds. This process is time-intensive and highly dependent on the researcher's expertise. Thus, automating hit selection would bring great benefit to the drug discovery community.
However, the success of any ML-based drug discovery endeavor heavily relies on data representation. [24][25][26] Not surprisingly, several representations have been proposed since the early days of chemoinformatics. One of the most popular manners to represent molecules or protein-ligand complexes is the so-called "fingerprint" (FP), a bit vector in which each bit accounts for the presence (1s) or absence (0s) of a given chemical feature or substructure. 27 Alternatively, bits can be substituted by the explicit frequency (count) of a given chemical feature or substructure.
Recently, learned representations 6,61,62 have also become popular with advancements in DNN techniques such as convolutional neural networks, 8,63,64 graph convolutional networks, [65][66][67] variational autoencoders, 68 and deep metric learning. 69 In summary, these approaches transform raw input features into a continuous learned representation, which means that the data representation is learned directly from the training data. The advantage of learned representations over traditional FPs is that they are unique and do not suffer from information loss due to bit collisions. However, they require considerably more data to achieve generalization and learn even simple patterns, 25 and have yet to prove their superiority consistently over traditional FPs. 70,71 In this paper, we present LUNA 1 , an object-oriented Python 3 toolkit for drug design that makes it easy to analyze very large data sets of 3D molecular structures and complexes, and that allows identifying, filtering, and visualizing atomic interactions. LUNA also im-  To validate and illustrate the applicability of the IFPs, we first present a case study where we trained several ML models on a data set containing 1 million molecules docked against Dopamine D4 72 to reproduce their DOCK3.7 73 scores as in a re-scoring task. We then compare our results to four related FPs: Extended-Connectivity FingerPrint (ECFP), 58 Functional-Class FingerPrint (FCFP), 58 Extended Three-Dimensional FingerPrint (E3FP), 59 and Protein-Ligand Extended Connectivity (PLEC). 60 Secondly, we demonstrate how key interactions emerge from looking at FP features with an interpretability study. Finally, we present two experiments where we evaluate the capacity of the novel FPs to distinguish ligands by pose similarity and to assess how different feature representations impacted these analyses. To do so, we built five protein-ligand complex data sets with varied diversity and complexity, and compared the results to ECFP, FCFP, E3FP, and PLEC.

Results
Novel hashed interaction fingerprints In this work, we propose three novel hashed IFPs inspired by ECFP, 58 FCFP, 58 and E3FP. 59 ECFP and FCFP are 2D hashed MFPs that encode molecular graph connectivity using either explicit atomic substructures or "coarse-grained" pharmacophoric properties, respectively. E3FP, another hashed MFP, attempts to circumvent ECFP's and FCFP's lack of explicit 3D molecular structure patterns. However, MFPs do not explicitly encode protein-interface information. PLEC, 60 a hashed IFP also motivated by ECFP, encodes interface information, but only in the form of proximal protein-ligand interactions, without interaction typing. It also omits pointers back to structural information that would enable better model interpretability. Thus, we develop the new IFPs to extend prior approaches to nonbonded interaction patterns at the ligand-protein interface and call them EIFP, FIFP, and HIFP.
While EIFP and FIFP apply the logic of ECFP and FCFP, respectively, HIFP adopts a "hybrid" approach, encoding pharmacophoric properties for atom groups and precise environment information for atoms. All these IFPs are RDKit 74 -compatible and their features represent interactions and contacts between protein residues, ligand atoms, and water molecules, by detecting their presence or absence (bit FPs), or their frequency (count FPs).
The IFPs rely on three parameters: the FP length, the radius growth rate, and the number of levels ( Figure 1). While the FP length determines the maximum number of features that can be individually represented in it, the radius growth rate and the number of levels control the number of features generated. In Section Interaction fingerprints, we discuss how these FPs are generated and how these parameters influence the featurization process.
The new IFPs are provided as part of LUNA, an object-oriented Python 3 toolkit for drug discovery. In addition, LUNA implements several features geared towards the analysis of molecular complexes, such as: a) accepting any molecular complex type, including protein-ligand and protein-protein; b) accepting multiple file formats, including PDB, MOL, and MOL2; c) providing pre-and post-processing functions to control how interactions are calculated or selected based on geometric constraints; and d) providing several functions to summarize, characterize, and visualize molecular interactions in Pymol ( Figure 2). The same view shown in (a) but rotated and filtered for hydrogen bonds (blue) and a displaced face-to-face stacking (red), which can be easily achieved directly through Pymol as shown in (c). (d) Two similar bits (neighborhoods) found in the FPs of two other CDK2-inhibitor complexes (PDBs 3QQF and 3QWJ, top and bottom). Arrows represent directional interactions.
Most importantly, LUNA provides powerful functionalities to make the IFPs interpretable.
One drawback of hashed FPs is that the structural information encoded in a bit is usually discarded after hashing. On top of that, due to FP folding, multiple substructures may end up encoded in the same bit, the so-called collision problem. To address this problem, LUNA keeps a record of the original features hashed into a new bit position. Thus, users can recover and reconstruct the neighborhood (collection of atoms or atom groups and their interactions) that generated a particular bit, and they can select one or more recovered neighborhoods to perform statistical analysis or export them to a rich Pymol session, where they will be able to visually assess the information present in a bit position (Figure 2d).
In the following sections, we will explore the potential of LUNA and the novel IFPs through a ML case study, an interpretability study, and demonstrate how they can be applied for detecting similar binding modes, an essential task in virtual screening.
Interaction fingerprints improve DOCK3.7 score regression. To assess the use of the IFPs for prioritizing hit compounds, we trained DNN models using a large data set recently published by Lyu et al.. 72 This library consists of ∼96 million molecules docked against the Dopamine D4 receptor (D4R), a G protein-coupled receptor (GPCR) superfamily member involved in many different roles in the central nervous system. Given its relevance as a neurotransmitter, dysregulation of the Dopamine D4 signalling cascade is linked to several pathological disorders including Parkinson's disease and schizophrenia. 75 Departing from Lyu et al.'s work, we built a random diverse and stratified data set based on DOCK3.7 score bins in order to achieve model generalizability through chemical diversity and representativeness across the DOCK3.7 score distribution. In total, we obtained 1 million molecules, from which a random 90% and 10% were set apart for model development and held-out testing, respectively.
As baselines, we chose the RDKit implementations of ECFP and FCFP. We also chose E3FP and PLEC for further comparison. FP parameters remained as default, except for the cases described in Section Model development and training. Additionally, by default, PLEC generated FPs of length 16,384. However, we hypothesized that a length of 4,096 might lead to results matching its longer version. To test this hypothesis, we optimized DNN parameters for the smaller PLEC version as well. To specify length, we will refer to a FP using its name followed by a hyphen and its length. For instance, PLEC-4,096 and First, we performed a hyperparameter search on LUNA's and the IFP parameters (Table   S1). Given the large parameter space to cover, we defined a separate and smaller training subsample of size 200,000 using the same strategy we described for the 1 million molecules data set, used exclusively for this experiment. From these, 20% was set apart for model validation. In all experiments, we used the best architecture found in previous investigations (results not shown) and, in total, 294 parameter combinations were evaluated (Figures S1 and S2; Excel file in the Supporting Information). Overall, we observed that FPs with fewer levels, high radius growth rate, precise atomic topology definition (EIFP), non-covalent interactions, and count information produced the best models (Section Interaction fingerprints details how these parameters influence the featurization process). In Table S2, we show the parameters for the best IFP (EIFP-4,096 ; R 2 = 0.52) selected for subsequent analysis.
The DNN hyperparameter search was employed on the space shown in Table S3 Finally, we applied the resulting DNN models to the test set, which consisted of 100,000 unseen protein-ligand complexes (Table 1). Overall, PLEC-16,384, EIFP-4,096, and EIFP-16,384 were the best FPs, followed by PLEC-4,096. 5-fold cross-validation showed similar results and demonstrated the differences between models were significant (Tables S4-S6 and Figure S16), which again highlights the best trade-off between FP length and performance of EIFP-4,096.
We observed that the validation and test set performances were similar. Considering that the diverse cluster heads (molecules) composing the sets used for model development (training and validation sets) and testing were mutually exclusive, these results suggested we succeeded at constructing balanced data sets by guaranteeing chemical diversity and representativeness across the DOCK3.7 score distribution. To test this hypothesis, we conducted  As the training and validation sets may share common cluster heads, we expected the model performance for the validation set to be higher than that of the test set when the model was trained with fewer data examples. However, the R 2 obtained for EIFP-4,096 across the different cuts showed that the performance of validation and test sets reduced proportionally and were still similar even after the cuts on the training set exceeding two orders of magnitude ( Figure 4). These findings confirmed our hypothesis that we obtained considerably generalized models due to the chemical structural diversity and representation across the DOCK3.7 score distribution.
Additionally, we observed that from subsample size 112,500 downwards, the models started overfitting. Also, the subsample 112,500 could be considered a good starting point data set size given its reasonable performance. Notably, the slope on the right-side of Figure   4 also indicated that the model could further improve if more data points were provided as input.
Larger IFPs ( Figure 4) showed similar behavior. However, the overfitting problem started earlier (after the subsample 225,000), indicating that larger FPs are more of a risk than a benefit for small data sets.

Key interactions emerge from fingerprint features. To illustrate how
LUNA can be used to support ML model interpretability, we performed a feature attribution analysis using the XGBoost library, whereby each feature received a score based on its contribution to the model's predictions. We favored XGBoost models (performance reported in Table S5) rather than the DNNs for this analysis due to the comparative ease of attribution interpretability. The more complicated attribution techniques for DNNs such as a integrated gradients 76 rely on defining a baseline input hyperparameter, 77,78 the ideal choice of which is less obvious for MFPs.
We trained the XGBoost model using the best EIFP identified in previous analysis and Due to bit collision resulting from FP folding, some features may be assigned to multiple classes. Thus, whenever it was not possible to assign a predominant class (threshold of 69.49%, chosen by z-scores), a feature was classified as 'unreliable' (see Section Model interpretability for more details).
Analysis of the feature importance scores revealed that a small number of features comprised the major contribution to model performance. To identify them, we calculated z-scores for each feature using their importance scores, which were then converted into p-values (see Section Model interpretability for more details). Most important features were then defined as those having p-values less than 0.01. In total, 38 features matched this requirement.
Among the most important features' classes, 34% (13) comprised shells containing PLIs, 29% (11) represented ligand atomic information, and the remaining classes sum to 37% (11) ( Figure S17). Another three features were considered unreliable as their most prevalent class had percentages smaller than the minimum value (69.49%) ( Figure S18). Interestingly, when accounting for all features regardless of importance, those containing PLIs were the rarest.
PLI features were more frequent among the most potent ligands (Figures S19 and S20).
We then assessed which interactions were encoded in the thirteen most important features comprising PLIs. As multiple interactions were assigned to the same feature ( Figure S21), we calculated z-scores with the percentage of the most prevalent interaction in each feature.
Then, a threshold to identify the most prevalent interaction in a given feature was defined as the minimum percentage (82.78%) among features presenting z-scores higher than 1. Two features presented percentages below the threshold and were classified as unreliable feature.
Prevalent interacting residues were also identified with the same threshold. We summarized the number of ligands having the most prevalent interactions and identify which residue After identifying the prevalent interactions, we observed that the three best features comprised ionic interactions with ASP115 and two others formed hydrogen bonds with that  Interaction fingerprints enhance their 2D counterparts. To validate the proposed IFPs and assess how different molecular and interaction information impact the identification of similar complexes, we compared the IFPs to molecular (ECFP, FCFP, and E3FP) and interaction (PLEC) FPs. Given the close relationship between molecular structure and interactions, we expected to find a high correlation between these FPs regarding their similarities. We also generated IFPs with restrictions on the encoded information as Figure 8: (a) Correlation between different FPs across five data sets with varied protein and ligands diversity. Correlations were obtained by calculating the Tanimoto (T C ) similarity between pairs of FPs of a given type. The overall correlation between different FP pairs was also assessed by joining all data sets into a single one (All data sets). Color scales for all data sets vary from 0.5 to 1, except for data set 1 whose color scales vary from 0 to 1. Light gray squares represent untested pairs. (b) Correlation between the ligand pairs' RMSDs and the FP similarities. The color scale varies from no correlation (blue color) to high negative correlation (red color).
a means to simulate ECFP, FCFP, E3FP, and PLEC using the LUNA toolkit, for further comparison. ECFP-like EIFP, for instance, contained only intra-ligand covalent interactions, as in classical ECFP. The complete list of features for each IFP is presented in Table 2. All FPs were generated considering a length of 4,096 bits and without encoding counts. Table 2: Parameters employed to construct each version of the IFPs.
With water Intra-ligand * E3FP considers any two non-bonded atoms as proximal. What defines if a pair will be included in the FP or not is the size of each sphere. TM-score 83 and T C measure proteins and ligands similarities, respectively.
In this analysis, we used five different data sets, progressively increasing protein and ligand variability, in order to evaluate how well FP similarity alone could discriminate similar poses as feature complexity progressed. The first data set contained two series of conformers for the ligand X02 in complex with the human cyclin-dependent kinase 2 (CDK2; PDB id 3QQK). The maximum conformers' distance to the crystal pose at each series were 0.4Å and 0.8Å, respectively. The second data set contained 74 CDK2-ligand complexes solved through X-ray crystallography and obtained from Schonbrunn et al.. 82 The third data set contained all CDK2 complexes found in the RCSB PDB (350 PDB ids in total). The fourth data set comprised non-CDK2 kinase complexes obtained from RCSB PDB (1,364 PDB ids in total). Finally, the fifth data set were also obtained from RCSB PDB and covered randomly selected diverse classes of proteins except for kinases (37,507 PDB ids in total).
Fourth and fifth data sets were balanced to avoid bias towards a receptor and only crystal structures with X-ray resolution below 2Å were retained. All complexes were recovered from RCSB PDB in May 2020. In Table 3, we presented each data set and the mean and standard deviation (std) for proteins' and ligands' similarities.
After generating FPs for complexes from data sets 1-5, we calculated the Tanimoto  Additionally, analysis of data set 1 revealed a negative correlation between the ligands' RMSDs and the IFP similarities (Figures 8b, S29, and S30). We also observed that similarity data points of IFPs tended to form groups according to the RMSD range ( Figures S23 and   S24). These expected observations highlighted the capacity of the IFPs and PLEC to identify similar binding modes given by ligands with very similar conformation.
The only exception for the overall RMSD trend was found for E3FP, which did not correlate either with the RMSD data or with EIFP and PLEC ( Figure 8). In data set 1, E3FP tended to separate data points into two regions based on FP similarities (first boxes in Figures S29 and S30). However, further separation by RMSD was not possible as low-and high-RMSD data points lay down at the same similarity range. Interestingly, the same has already been observed in Wang et al., 64 in which molecules with low RMSD produced small molecular similarities. Only after setting the FP parameters as in EIFP (E3FP w/ EIFP params), a reasonable correlation with the RMSD and EIFP emerged. Thus, considering that E3FP was conceived and parameterized to compare different ligands and not conformers of a single ligand, we believe the default configuration may not be appropriate in this scenario as it was not intended to distinguish two poses of the same ligand regardless of its RMSD.
Finally, we also found a near-random correlation between EIFP and PLEC in the context of conformers with high RMSDs (> 0.8Å) at data set 1 ( Figure S24). Setting PLEC with the same number of levels and radius growth rate from EIFP did not increase the correlation as observed for E3FP. A high correlation (r ∼ 0.84) was only recovered with PLEC-like EIFP, an EIFP version that mimicked PLEC implementation. Therefore, we believe the low observed correlation between PLEC and EIFP may be caused primarily by the difference between their implementations.
Interaction fingerprints identify similar poses. The need to rank molecules based on molecular or binding mode similarity comes up frequently in drug discovery. To evaluate the use of IFPs in distinguishing ligands by binding mode similarity, we again turned to data sets 1-5 (Table 3). We compared the IFPs against ECFP, FCFP, E3FP, and PLEC for this task.
In this analysis, we separated the pairs of complexes based on their proteins and ligands into six groups: complexes with the same proteins and ligands; complexes with the same proteins and ligands with high molecular similarity (ECFP T C > 0.5); complexes with the same proteins and ligands with low molecular similarity (ECFP T C ≤ 0.5); complexes with different proteins and the same ligands; complexes with different proteins and ligands with high molecular similarity (ECFP T C > 0.5); and complexes with different proteins and ligands with low molecular similarity (ECFP T C ≤ 0.5). By doing so, we expected to reveal that complexes comprising the same proteins and ligands with high molecular similarity (ECFP T C > 0.5) would tend to present the highest IFP similarities as well.
Indeed, we found that ligands with low molecular similarities presented the lowest IFP similarities, whose values continuously reduced as we moved towards the more diverse scenarios ( Figure 9). Moreover, similarity distributions for ligands with high (ECFP T C > 0.5) and low (ECFP T C ≤ 0.5) molecular similarities over the same binding site did not overlap with EIFP and PLEC as expected. In contrast, the observed overlapping similarity distributions for complexes comprising the same ligand or ligands with high molecular similarity (ECFP T C > 0.5) bound to different proteins indicated that the binding site played the major role in defining the ligands' binding mode.
Overall, EIFP and PLEC showed similar separability trends across the different data sets and pair types. However, we also identified divergences between these FPs. For instance, the pair of complexes highlighted by the marker in Figure 9 was an example where EIFP produced a high similarity (∼ 0.71), while PLEC and E3FP attributed to them low similarities (0.42 and 0.28, respectively). Interestingly, these complexes contained different proteins (proteasome beta type-5 subunit, PDB id 5LF4; and beta type-1 proteasome subunit, PDB id 5LF7) bound to the same ligand (pentaethylene glycol). Nonetheless, despite being different proteins, they presented high structural similarities (TM-Score 83 = 0.857), which explained the high similarity between these two complexes according to EIFP. In Figures S31 and S32, we show common and exclusive features between these two complexes found with EIFP.
Finally, varying EIFP lengths (512, 1,024, 2,048, 8,192,16,384, and unfolded), or using a different IFP representation (FIFP) did not impact the separability between similar and dissimilar binding modes (Figures S33 and S34). Thus, despite the expectation that shorter FPs would produce more feature collision rates and information losses, we observed consistency on all variations of IFP.
Altogether, these findings demonstrate the capacity of the IFPs to identify complexes with high similarity. Figure 9: Similarity between FP pairs for the five data sets presented in Table 3. Pairs of complexes were divided into six groups based on their proteins and the molecular similarity (ECFP) between their ligands. Marker represents complexes with same ligands discussed in the text.

Discussion
Traditional representations such as MFPs and IFPs have been applied for a long time in drug discovery. Such simple representations are useful for fast filtering or querying a database to identify molecules by structural or binding mode similarity to a set of known molecules and complexes. Although these FPs were also widely applied in ML-based drug discovery, there is still a need for new methods that satisfactorily capture the molecular recognition With the first experiment, our goal was to assess if IFPs could reproduce the observed DOCK3.7 scores as in a re-scoring task. To do so, we leveraged a large dopamine D4 receptor docking data set by Lyu et al. 72  Interestingly, the best IFP configuration included intra-residue interactions. Intuitively, one might have expected that because the protein structure is always the same across all complexes in a single-target docking campaign, the intra-residue interactions would always be the same for the protein structure, and thus yield no meaningful contribution to the model. However, the ligand-pose-specific inclusion of intra-residue interactions and the residues themselves do contribute, by augmenting local ligand context and implicitly providing information about the binding site region occupied by the ligand.
After identifying the best IFP, we trained multiple DNN models on the second data set (1 million molecules docked against the same dopamine D4R structure Finally, in the third experiment, we compared the IFPs to "baselines" of standard molecular (ECFP, FCFP, and E3FP) and interaction (PLEC) FPs employing five data sets with increasing degrees of complexity by protein and ligand variability. We assessed how IFPbased similarity corresponded to baseline-FP similarities, as a function of differing molecular and interaction information. Secondly and most importantly, we evaluated the IFPs' capacity to identify similar vs dissimilar poses no matter the data set diversity, which is a task commonly performed in virtual screening campaigns.
We found medium to high correlations between the FPs in all data sets. In part, high correlations are consistent with the common core strategy of defining structural features (circular substructures centered around an atom or atom group) and with the similarities across atomic or pharmacophoric properties definitions. This applies especially for MFPs, for which previous studies report high cross-correlations. 59 The terms negatively ionizable and positively ionizable were chosen to indicate that a specific atom or group of atoms may be ionized. However, if necessary, one can set a different pH in order to alter the resulting classification of an atom or group.
For more details, refer to Section Physicochemical feature assignment in the Supporting Information.

Molecular interactions calculation.
For each pair of atoms or group of atoms, molecular interactions are characterized using physicochemical properties and geometrical criteria (distance and angle). For atom groups, their centroids are used in the geometrical analysis.
Besides interactions, LUNA also provides three contacts: atom overlap, proximal, and van der Waals clash. Atom overlap identifies artifacts generated by low-resolution structures and homology models, which consist of the unnatural overlap of two atoms. Proximal is an optional contact, which simply indicates that two atoms are close to each other by a specific threshold. And van der Waals clash characterizes the repulsion between two atoms when they become too close.
The combination of physicochemical features, distance, angle criteria, and geometrical models utilized to calculate interactions are presented in the Supporting Information. Although LUNA implements its own methods and criteria to calculate interactions, users can also define their own functions and cutoffs -thanks to the object-oriented style employed in LUNA.
Interaction fingerprints. We propose three novel hashed IFPs called EIFP, FIFP, and HIFP, which are inspired by ECFP, FCFP, 58 and E3FP. 59 Similar to ECFP, FCFP, and E3FP, our IFPs depend on three parameters: the FP length, the radius growth rate, and the number of levels ( Figure 1). Below, we detail how the IFPs are generated.
Generating initial identifiers. At iteration 0, initial identifiers are assigned to each atom or atom group. In EIFP, the atoms are assigned seven invariant atomic properties derived from ECFP and E3FP: the number of heavy atoms covalently bound to the atom; the valence minus the number of neighboring hydrogens; the atomic number; the atomic isotope number; the atomic charge; the number of bound hydrogens; and a flag indicating whether the atom belongs to a ring or not. For atom groups, EIFP includes the initial identifier of each atom comprising the group. As a result, EIFP precisely characterizes topological information like ECFP.
In FIFP, only physicochemical features are considered during the identifier generation.
For atoms, FIFP assigns their own physicochemical features and features from any atom group to which they belong. For atom groups, FIFP encodes only the physicochemical features of the group. For example, the chemical feature Aromatic assigned for an aromatic ring (atom group) is also included as the feature of each one of its atoms. Therefore, FIFP represents more abstract role-based sub-structural features.
Finally, HIFP is a hybrid flavor that encodes atomic invariants for atoms and physicochemical features for atom groups.
After characterizing each atom or atom group according to the IFP flavor, their information is encoded to their initial 32-bit integer identifiers using a hash function. LUNA uses a Python implementation 2 of MurmurHash3 124 hash function. However, any hash function can be applied as long as it generates uniform and random identifiers.
Subsequent identifiers update. After generating each initial identifier, the algorithm subsequently updates them through an iterative process that continues until it converges or it reaches the maximum number of levels, which is given by the parameter Number of levels ( Figure 1). At each iteration, a sphere of size R * L, where R is the radius growth and L the current level (iteration number), is centered at each atom and atom group. Their neighborhood is then characterized by capturing all interactions within the shell. A hash function is applied to the neighborhood, and a new identifier to the central atom or atom group is generated.
A single iteration for a given atom or atom group is performed as follows. First, LUNA centers a sphere of size R * L on an atom or atom group, and initializes an array of tuples with a pair consisting of the current level number and the previous identifier of this atom or atom group. Next, it captures all atoms (atom groups) inside the shell and verifies if these entities establish any interaction with the atoms or atom groups from the previous iteration.
Then, for each identified interaction, LUNA generates a tuple containing the interaction type and the previous identifier of the interacting partner. These pairs are sorted and included in the array of tuples. Note that sorting the list is essential for avoiding dependence on the order of its elements; otherwise, the FP generation would not be deterministic. of the cluster heads were randomly selected for model development and training, while the remaining 10% were set apart as a hold-out set and were not released until the model development phase of the study had been completed. Finally, to obtain diverse and stratified sets, we divided the cluster heads into 10 equal-sized bins by their DOCK3.7 score and randomly sampled 90,000 and 10,000 molecules from each bin to form the expanded training (900,000) and testing (100,000) data sets, respectively.
Using the same procedure and bin sizes, we defined another independent training sub- For LUNA and IFP hyperparameter optimization, models were trained using the training subsample (200,000 complexes) and the best DNN architecture found in previous investigations (results not shown). We explored combinations of ten different parameters (Table   S1).
For the best IFP and for the chosen alternative FP methods, models were trained using the complete training set (900,000 complexes) and the best DNN architecture for each FP was identified through hyperparameter search on the space shown in For FPs of length 16,384, a minimum of 100 epochs were run before early-stopping was activated. Besides the standard dropout strategy where the same probability was employed in all dropout units, we also applied a procedure where the dropout probability was decreased at an exponential decay rate -shallow hidden layers had a greater decay.
Finally, we assessed the best models' stability using 5-fold cross-validation for the DNN, random forest, and XGBoost methods. To guarantee reproducibility, we used an initial seed (54,343) to generate five random integer values that were used as seeds to randomly define the hold-out set and initialize the DNN models at each fold iteration. Mean and standard deviation were then calculated for the R 2 values obtained from the 5-fold cross-validation strategy.
Model interpretability. To illustrate how the out-of-the-box LUNA functionalities are suitable for creating interpretable ML models, we performed a feature attribution analysis based on the feature importance scores calculated by the XGBoost library.
The XGBoost model was trained using EIFP-4,096 (the best FP identified in the ML case study) and 80% of the training set (Section Model development and training).
To make sense of a feature and its importance, LUNA was used to assign a class to As bit collision resulting from FP folding may generate features with different information, LUNA may assign multiple classes to them. To deal with this expected artifact of hash functions, we first calculated the percentage of the most prevalent class in a given feature and used these percentages to calculate z-scores for each feature. We then filtered out features presenting z-scores higher than 1 and chose the minimum percentage (69.49) among the remaining features as the threshold to attribute a class to each feature ( Figure S35).
Thus, the most prevalent class was assigned to a feature only if its percentage was greater than or equal to the threshold. Whenever a feature position did not meet this condition, we attributed the class unreliable feature to it.
To identify the set of features that played the major contribution to the model performance, we calculated the z-scores for each feature using their importance scores. As data followed a generalized extreme value distribution ( Figure S36), we transformed z-scores to Fingerprint assessment data sets. To assess the proposed IFPs, we built five different data sets. For the first data set, we automatically generated two series of conformers for the compound 2-(allylamino)-4-aminothiazol-5-yl(phenyl) methanone (PDB id X02) in complex with the human cyclin-dependent kinase 2 (CDK2; PDB id 3QQK). Conformers were generated with the function EmbedMultipleConfs from RDKit with the parameter numConfs and pruneRmsThresh set to 10,000 and 0.1Å, respectively. The first parameter defined the number of conformers the algorithm should generate, while the second removed conformers whose distance (RMSD) to other conformers was less than 0.1Å. This pruning procedure only retained the first conformation generated. From then on, only those that were at least pruneRmsThresh away from all retained conformations were kept. After generating the conformers, we aligned them to the ligand X02 and removed the conformers whose distance to the crystal pose was higher than 0.4Å ( Figure S37a) or 0.8Å ( Figure S37b). As a result, we obtained 53 and 431 conformers, respectively.
The second data set was obtained from Schonbrunn et al.. 82 In their work, the authors proposed 95 analogs by systematically modifying the flanking allyl and phenyl moieties of the ligand X02. Structures of 36 CDK2-ligand complexes were solved through X-ray crystallography. We also found in the PDB an additional 38 related structures that were deposited by the authors, totaling 74 complexes in this data set.
The third data set was obtained from RCSB PDB in May 2020 after searching for structures that present 100% sequence identity to CDK2 (350 PDB ids in total). We then used LUNA to identify PDB structures containing exactly one ligand, excluding crystallography artifacts, ions, and cofactors (Tables S7, S8, and S9).
Finally, the fourth and fifth data sets were obtained from RCSB PDB in May 2020 through its custom report API. The former comprises non-CDK2 kinases filtered from the recovered report based on their Uniprot identifiers (1,364 PDB ids in total). The latter consists of a randomly selected diverse class of proteins except kinases (37,507 PDB ids in total). To avoid bias towards a specific receptor, we generated a random subsample of both data sets using the Spreadsubsample algorithm from Weka 131 with at most a 5:1 difference in receptor frequencies. We then used LUNA to identify PDB structures containing exactly one ligand, excluding crystallography artifacts, ions, and cofactors (Tables S7, S8, and S9).
For these data sets, only crystal structures with X-ray resolution below 2Å were retained.
Comparing LUNA fingerprints to other fingerprint methods. To compare the IFPs to other molecular (ECFP, FCFP, and E3FP) and interaction (PLEC) FPs, we computed the similarity between all pairs of complexes in data sets 1-5 (Section Fingerprint assessment data sets) using the Tanimoto coefficient (T C ). We then measured the degree of correlation between these different FPs using the Pearson correlation coefficient.
Bit FPs were generated using default parameters from each method. Interactions for the IFPs were calculated using all LUNA's default parameters. Molecular structures for the baseline methods were prepared as described in Section Preparing molecular structures for ECFP, FCFP, E3FP, and PLEC.
We also generated different flavors of EIFP and FIFP as a means to mimic the other baseline methods. ECFP-like EIFP, for instance, only contains intra-ligand covalent interactions exactly as ECFP. The complete list of features included in each IFP flavor is presented in Table 2. We easily obtained these different IFP variations through flags that control LUNA behavior or by providing filtering functions.
For each pair of conformers from data set 1, we also calculated their RMSD using RDKit.
Then, Pearson correlation coefficient between RMSD values and interaction FP similarities was measured.
To prepare the inputs for alternative FP methods, we used the same LUNA functionalities employed for the IFPs. First, we extracted compounds from each protein-ligand complex; standardized their topological structure based on their SMILES representation obtained from the Ligand Expo database; 132 properly charged the molecules considering a physiological environment; and converted them from PDB to MOL. Finally, we filtered out molecules whose number of heavy atoms differed by more than one atom compared to their expected topological structure or whose structure did not match its expected topology. The structural match between ligands was computed using the maximum common substructure algorithm (FindMCS) from RDKit.

Conclusion
Herein, we propose three novel hashed IFPs: EIFP, FIFP, and HIFP. Our FPs are inspired by ECFP, FCFP, and E3FP, and encode protein-ligand interactions and ligand environment information as either bit or count FPs. Additionally, we propose LUNA 3 , a novel object-oriented Python 3 toolkit that permits researchers to control every processing step during the analysis of a drug discovery campaign. LUNA also provides functionalities for visualizing interactions and FP features to open up the "black box" of ML methods. These characteristics make LUNA IFPs suitable for shedding light on the molecular recognition process and ML models, as well as for prioritizing compounds in virtual screening campaigns.
In this work, we also validated and demonstrated that the IFPs are capable of distinguishing ligands by pose similarity using data sets of differing size and structural diversity.
We also assessed how different feature representations impact these analyses and compared our results to existing MFPs (ECFP, FCFP, and E3FP) and to an additional IFP (PLEC).
Finally, to illustrate how LUNA FPs can be applied to binding affinity prediction, we presented a case study wherein we trained ML models on a data set containing 1 million molecules docked against the Dopamine D4 receptor to reproduce calculated Dock scores. We also applied ECFP, FCFP, E3FP, and PLEC to the same study case. Overall, EIFP-4,096 's performance was comparable to substantially longer FPs (e.g., EIFP-16,384 and PLEC-

Notes
The authors declare no competing financial interest.