Identifying protein subsets and features responsible for improved drug repurposing accuracies using the CANDO platform

Drug repurposing is a valuable tool for combating the slowing rates of novel therapeutic discovery. The Computational Analysis of Novel Drug Opportunities (CANDO) platform performs shotgun repurposing of 3,733 drugs/compounds that map to 2,030 indications/diseases by predicting their interactions with 46,784 protein structures and relating them via proteomic interaction signatures. The accuracy of the CANDO platform is evaluated using our benchmarking protocol that assesses indication accuracies based on whether or not pairs of drugs associated with the same indication can be captured within a certain cutoff, which is a measure of the drug repurposing recovery rate. To identify subsets of proteins that exhibit the same therapeutic effectiveness as the full set, groups of 8 proteins were randomly selected and subsequently benchmarked 50 times. The resulting protein sets were ranked according to average indication accuracy, pairwise accuracy, and coverage (count of indications with non-zero accuracy). The best 50 subsets of 8 according to each metric were progressively combined into supersets after each iteration and benchmarked. These supersets yield up to 14% improvement in benchmarking accuracy, and represent a 100-1,000 fold reduction in the number of proteins relative to the full set. Protein supersets optimized using independent compound libraries derived from the full library were cross-tested and were shown to reproduce the performance relative to using all 46,784 proteins, indicating that these reduced size supersets are broadly applicable for characterizing drug behavior. Further analysis revealed that sets comprised of proteins with more equitably diverse ligand interactions are important for describing drug behavior. Our work elucidates the role of particular protein subsets and corresponding ligand interactions that play a role in computational drug repurposing, and paves the way for the use of machine learning approaches to further improve the accuracy of the CANDO platform and its repurposing potential. Author summary Drug repurposing is a valuable approach for ameliorating the current problems plaguing drug discovery. We introduce a novel protein subset analysis pipeline that allows us to elucidate features important for drug repurposing accuracies using the Computational Analysis of Novel Drug Opportunities (CANDO) platform. Our platform relates drugs based on the similarity of their interactions with a diverse library of proteins. We subjected all proteins in the platform to a splitting and ranking protocol that ranked protein subsets based on their benchmarking performance. Further analysis of the best performing protein subsets revealed that the most useful proteins for describing how small molecule compounds behave in biological systems are those that are predicted to interact with a structurally diverse range of ligands. We hypothesize that this is a consequence of the multitarget nature of drugs and, conversely, the implied promiscuity of proteins in biological systems. These results may be used to make drug discovery more accurate and efficient by alleviating some of its bottlenecks, bringing us one step further in better understanding how drugs behave in the context of their environments.

Common strategies in drug discovery include forward pharmacology [1] and rational 2 drug design [2]. In the former, a library of compounds is screened, typically in high 3 throughout manner, for certain phenotypic effects in vitro. In the latter, compounds are 4 virtually screened against a predetermined biological target and high confidence hits are 5 then assayed for a desired modulation. In both cases, the hits obtained are then 6 assessed for effectiveness in vivo and proceed to clinical trials for eventual FDA 7 approval if successful at each stage. This iterative process can cost billions of dollars 8 and up to 15 years per drug [3]. These approaches do not consider the promiscuity of 9 approved drugs in the context of indications/diseases within living systems (evidenced 10 by side effects present for all small molecule therapies [4,5]), dooming many novel 11 therapeutics to fail. With the second-leading cause of putative drug attrition being 12 adverse reactions [6], there is great utility in finding new uses for already approved 13 drugs, which is formally known as drug repurposing or repositioning [7,8]. 14 We have developed the Computational Analysis of Novel Drug Opportunities 15 (CANDO) platform [9][10][11] to address these drug discovery challenges. One fundamental 16 tenet of CANDO is that drugs interact with many different proteins and pathways to 17 rectify disease states, and this promiscuous nature is exploited to relate drugs based on 18 their proteomic signatures [9,[12][13][14][15]. These signatures are typically determined via 19 virtual molecular docking simulations that are applied to predict compound-protein 20 interactions on a proteomic scale. Using a knowledge base of known drug-indication 21 approvals/associations, we can identify putative drug repurposing candidates for a 22 particular indication based on the similarity of their proteomic interaction signatures to 23 all other drugs approved for (or associated with) that indication. When a particular 24 indication does not have any approved drug, the library of human use compounds 25 present in CANDO is screened against the tertiary structures of all relevant and 26 tractable proteins obtained by x-ray diffraction or homology modeling from a particular 27 organismal proteome to suggest new treatments that maximize binding to the 28 disease-causing proteins and minimize off-target effects. High-confidence putative drug 29 candidates generated by CANDO using both approaches have been prospectively 30 validated preclinically for a variety of indications, including dengue, dental caries, 31 diabetes, hepatitis B, herpes, lupus, malaria, and tuberculosis, with 58/163 candidates 32 yielding comparable or better therapeutic activity than standard 33 treatments [12,13,16,17]. 34 To date, putative drug candidates generated by CANDO have been based on simple 35 comparison metrics, primarily the root mean square deviation (RMSD) of the binding 36 scores present in a pair of drug-proteome interaction signatures. Our platform is 37 evaluated using a benchmarking method that assesses per indication accuracies based 38 on whether or not other drugs associated with the same indication can be captured 39 within a certain cutoff in terms of similarity to a particular drug approved for that 40 indication. Incorporating machine learning, which is continuing to prove its utility in 41 many aspects of biomedicine [18][19][20] including drug discovery and repurposing [21,22], 42 August 24, 2018 2/17 into the CANDO platform to increase benchmarking accuracies and therefore its 43 predictive power is of importance. Various algorithms can be incorporated (for example, 44 neural networks, support vector machines, and decision trees), but the well documented 45 issues described by the curse of dimensionality [23,24]  OpenBabel FP4 fingerprinting method [37], resulting in a structural similarity score.

103
The score that populates each cell in the compound-protein interaction matrix is the 104 maximum value of all of the possible binding site scores times the structural similarity 105 scores of the associated ligand and the compound.

106
Benchmarking protocol and evaluation metrics

107
The compound-compound similarity matrix is generated using the root mean square 108 deviation (RMSD) calculated between every pair of compound interaction signatures 109 (the vector of 46,784 real-value interaction scores between a given compound and every 110 protein in the library). Two compounds with a low RMSD value are hypothesized to 111 have similar behavior [9-11, 13, 15]. For each of the 1,439 indications with two or more 112 associated drugs, the leave-one-out benchmark assesses accuracies based on whether 113 another drug associated with the same indication can be captured within a certain 114 cutoff of the ranked compound similarity list of the left-out drug. This study primarily 115 focused on a cutoff of the ten most similar compounds ("top10"), the most stringent 116 cutoff used in previous publications [9-11, 13, 15]. The benchmarking protocol calculates 117 three metrics to evaluate performance: average indication accuracy, (top10 in this study) and d is the number of drugs approved for that given indication.

122
Pairwise accuracy is the weighted average of the per indication accuracies based on how 123 many drugs are approved for a given indication. Coverage is the count of the number of 124 indications with non-zero accuracies within the top10 cutoff.

125
Superset creation and benchmarking 126 The 46,784 proteins in the CANDO platform were randomly split into 5,848 subsets of 8 127 and subsequently benchmarked using the method described above. The size of 8 was 128 selected because offered the widest range of benchmarking values (relative to larger 129 sizes), reduced the computational cost of the experiments (relative to smaller sizes that 130 increase the number of individual benchmarks that need to be evaluated), divided into 131 46,784 evenly, and also provided adequate signal for the multitargeting approach to work 132 according to our prior studies [12]. A total of 50 iterations were performed that resulted 133 in 292,400 benchmarking experiments. Each subset was then ranked according to top10 134 average indication accuracy, pairwise accuracy, and coverage. The fifty best performing 135 subsets from each ranking criterion (average indication accuracy, pairwise accuracy, and 136 coverage) were progressively combined into supersets and benchmarked after each of the 137 50 iterations of the splitting and ranking protocol. The subsets were nonredundantly combined such that if a given protein was represented in the best performing subsets The best performing protein subsets and supersets were further analyzed to elucidate 160 the protein feature(s) responsible for increased benchmarking performance. The protein 161 subsets and supersets were analyzed based on four criteria: organismal source, structure 162 source (x-ray diffraction or modeling), fold space (based on the CATH classification of 163 proteins [38]), and interacting ligand structure distributions. The subsets and supersets 164 were analyzed by counting the specific organisms to which the proteins belonged to see 165 if any were over or underrepresented in the best and worst performing sets. Similarly, 166 the subsets and supersets were analyzed to see if structures obtained via a specific 167 source, x-ray diffraction or modeling, were differentially represented in the best and 168 worst performing sets. Fold assignments were made to each protein in the subsets and 169 supersets which were again analyzed for differential representation of specific protein 170 folds. Finally, since our compound-protein interaction scoring method utilized the 171 structural similarity of each drug/compound to the ligand co-crystallized with the 172 protein (see previous section), we analyzed these ligands for differential representation. 173 Each co-crystallized ligand in the COFACTOR database of template binding sites were 174 clustered at various distances (0.1 to 0.9 with increments of 0. performance for each superset within a given iteration tends to gradually increase as the 212 number of iterations increases (Fig 1). Ranking subsets using coverage is overall the  Sorting the supersets by size reveals that at least 80-120 proteins are required to 223 reach optimal benchmarking performance (Fig 2). Nonredundantly combining the worst 224 performing subsets of 8 into subpar sets demonstrates worse performance than the 225 control value for the average indication accuracy based on using the full library, with 226 the mean values of these subpar set distributions being below the acceptable 5% 227 threshold of 11.6% for all sizes benchmarked. The random set and subpar set 228 distributions begin to converge toward the average indication accuracy control value as 229 size increases (Fig 2).  being the best ranking criterion through ten iterations, after which coverage is superior, especially when measuring the coverage metric. This result demonstrates that the splitting and ranking protocol can produce supersets with benchmarking performance superior to using the full protein library by combining the best performing subsets with a vastly reduced number of proteins (100 to 1,000 fold reduction in size), further suggesting that specific groups of proteins are relatively more useful for drug repurposing accuracy.

Cross-testing with independent compound libraries 234
For the independent compound library experiment, average indication accuracy was 235 chosen as the ranking criterion because it performed the best in all three metrics 236 through ten iterations in the superset experiment (Fig 1). Box and whisker plots for 237 each compound library show the spread of the benchmarking performance for each 238 metric generated from the protein supersets obtained through the splitting and ranking 239 protocol of their complementary compound library (Fig 3). Supersets less than size 80 240 were excluded because there is a minimum number of protein features required to reach 241 optimal benchmarking performance (Fig 2). The control value for the given metric fell 242 within the inter-quartile range or below in 26 cases (87%), while it was within the upper 243 quartile in the remaining four cases (13%), indicating that these values always fell within 244 the range of the accuracy/coverage distributions. The control value was an extreme Superset, subpar set, and random set average indication accuracies sorted by size. Average indication accuracies are shown for the supersets (blue) generated using the best subsets ranked by the same metric. The line traces the mean score for each size with the bars indicating one standard deviation for the distribution. Subpar sets (orange) are the combinations of the worst performing subsets ranked by average indication accuracy. Randomly selected protein sets (green) of each size were also generated and benchmarked. The control value based on using the full protein library (dashed black at 12.3%) and an acceptable 5% threshold (dotted black at 11.6%) are plotted for reference (i.e., any protein set that benchmarks within 95% of the control value is considered acceptable). For the random sets and supersets, the performance in terms of average indication accuracy begins to plateau around 80-120 proteins. The supersets begin to slightly decline in performance after 32-33 subsets (256-264 proteins). The mean subpar set accuracies at each size all fall outside the 5% acceptable threshold, while the superset distributions are well above the control value with as few as five subsets. The difference between the superset and subpar set performance suggests that there is a particular distribution of features within their proteins that is correlated with benchmarking performance. supersets to describe compound behavior more effectively than the full protein library. 247

Ligand clustering and feature-based creation of protein libraries 248
The protein subsets and supersets were analyzed based on four criteria to elucidate the 249 feature(s) responsible for benchmarking performance: organismal source, structure 250 source (x-ray diffraction or modeling), fold space, and interacting ligand structure 251 distributions. There were no significant correlations found for the first three criteria; no 252 organism(s) or fold(s) was consistently represented in the best performing sets, while  The solid blue and solid orange points are averages of 50 best and worst subsets, respectively. Dashed lines represent an example of a superset (blue) and a subpar set (orange) which are nonredundant combinations of the best and worst performing subsets respectively. The control set (dashed black), representing the full protein structure library, falls in between the superset and subpar set. The black diamonds indicate that the distribution of counts at that cluster rank between the best and worst performing subsets, assessed using Welch's T-test, is significant (p-value < 0.05). The subsets and supersets with the best performance demonstrate a more equitable distribution of interactions among ligand clusters as opposed to the worst performing subsets and subpar sets, indicating that using multitargeting proteins to compose our structure libraries yields superior benchmarking performance.
Based on the subset analysis, it was hypothesized that proteins having a more 276 diverse and equitable distribution of interactions will benchmark better. Protein protein, we applied a upper cutoff filter for the number of ligand clusters allowed (Fig 284  5); all libraries were limited to size 80, which is the minimum required to reach optimal 285 performance based on data in Fig 2. There is an optimal upper limit of ≈20-40 ligand 286 clusters with regard to the benchmarking, indicating that using too many ligand 287 clusters to describe a protein is undesirable for characterizing drug/compound behavior. 288 Based on the data in Fig 5, we created protein libraries of various sizes using a upper 289 cutoff of 40 (Fig 6). We begin to consistently recapture the benchmarking performance 290 of the full library (within 5% error) with as few as 60 proteins. Libraries composed of The splitting and ranking protocol was originally intended to find a protein subset that 300 benchmarked as least as well as the full set. The improvement of the benchmarking 301 performance is an encouraging sign for incorporating machine learning in the CANDO 302 platform in the future, and discovering how more complex weighting and relating of 303 proteins contribute to drug repurposing accuracy, which is difficult to do with simple 304 RMSD calculations. The smaller-sized protein libraries generated as part of this study, 305 representing a 100 to 1,000 fold reduction in size, will be more conducive to machine 306 learning. Feature reduction through the use of neural network based auto-encoders or 307 principal component analysis will provide an important contrast to our proposed 308 method.

309
The independent compound library experiment demonstrated that optimized protein 310 sets based on a particular library were capable of therapeutically characterizing a 311 completely different one, indicating that these supersets are generalizable. In other 312 words, if a new drug/compound is added to the CANDO putative drug library, these 313 reduced size supersets are likely able to describe its behavior at least as well as using 314 every protein available. In addition to facilitating machine learning, our findings suggest 315 a greatly reduced time required to generate new proteomic interaction vectors, which is 316 particularly important if the program/protocol of choice for generating interactions is 317 computationally expensive. Any repurposing candidates suggested from using the 318 supersets are on average more clinically relevant as they were able to recapture drug 319 behavior more accurately than using the full protein library in a statistically significant 320 manner.

321
The knowledge that ligand cluster interaction diversity is the key requisite feature in 322 describing drug behavior may also lead to optimization strategies other than the 323 computationally intensive splitting and ranking protocol used in this study. The . A minimum of two ligand clusters was required to avoid the trivial case of only one cluster mapped (with a variance of zero). A upper ligand cluster count cutoff was applied to these protein libraries to determine the effect on benchmarking performance. The blue line traces the average indication accuracy distribution mean from benchmarking 50 libraries at each cutoff using the top ranked proteins. Similarly, the orange line traces the average indication accuracy distribution mean from benchmarking 50 libraries at each cutoff using the highest variance proteins. The bars indicate one standard deviation for the distribution. Using too small (< 20) or too large (> 40) of a cutoff results in suboptimal benchmarking performance. The high variance libraries result in average performance far below the acceptable 5% error range, with the cluster cutoff seemingly having no effect on average indication accuracy as all cutoffs produced comparable distributions. All average indication accuracies produced using a upper cutoff of 20-40 ligand clusters were within the acceptable 5% error range, with the upper cutoff of 40 ligand clusters producing the greatest mean accuracy. This result indicates that there is an optimal range of ligand cluster interactions to best describe therapeutic behavior. therapeutically relevant if the proteins used to describe the behavior of a compound are 333 diverse in terms of the structures of the ligands which interact with their binding sites. 334 Protein libraries with fewer predicted ligand cluster binding partners yield much worse 335 performance than those consisting of proteins interacting with a more structurally 336 diverse range of ligands. Coupling this with the finding that there is a minimum 337 number of proteins required to reach optimal benchmarking accuracies (Fig 2), which 338 was also observed by us previously [10,11], drugs should realistically be described in the 339 Using too small of a size (< 60) results in suboptimal benchmarking performance. Creating libraries from proteins with the highest variance results in performance on average far below the acceptable 5% error range, although size does have a positive correlation with performance with these high variance sets. This result reiterates that there is a minimum number of proteins required to reach optimal benchmarking performance and that proteins with high variance in their ligand cluster signatures are far superior for describing drug/compound behavior.
context of their multitarget nature, treating both small molecule compounds and 340 proteins promiscuously, as in biological systems [43][44][45]. However, using libraries of 341 proteins with too many diverse interactions in the CANDO platform also leads to 342 suboptimal performance. We hypothesize this can be attributed to two factors: 1) 343 spreading a compound interaction signature across too many (50 or more) ligand 344 clusters can potentially dilute the therapeutic signal relative to the promiscuity of the 345 corresponding proteins; or 2) these proteins are not therapeutically relevant and are 346 therefore not useful for specifically describing drug behavior.   Visualization of the best and worst protein types for benchmarking performance based on ligand cluster signatures. Centroids of the top five ligand clusters from the signatures of each protein are depicted. The percent of interactions belonging to each ligand cluster are next to their respective centroid. Surface representations of the proteins were made using Chimera [42] with the predicted binding site residues from COFACTOR for each ligand shown (excluding the smaller ligands in C) colored in blue. A) Alkaline serine protease KP-43 from Bacillus subtillus: the top five ligand clusters account for 46.3% of the total interactions with the distribution between them being relatively equitable. B) SET domain of human histone-lysine N-methyltransferase: only two ligand clusters are predicted to interact, with one having over 98% of the total interactions. C) Human STE20-related kinase adapter protein beta: the ligand cluster signature is too promiscuous with the top five ligand clusters accounting for only 18.3% of the total interactions; the remaining sixteen ligands surrounding the protein account for 28.7%, which combined with the top five clusters is as much as the 46.3% of the total interactions shown in A from only five clusters. Subsets and supersets composed of proteins similar to A outperform those composed of proteins similar to B and C in benchmarking, indicating that moderately promiscuous proteins with equitable ligand cluster signatures are the best therapeutic descriptors.

348
We have developed an integrated pipeline that allows for the elucidation of proteins and 349 their features that are important for benchmarking in the CANDO platform, and 350 therefore important for drug repurposing. We are able to reproduce the performance of 351 the complete CANDO protein structure library with orders of magnitude fewer proteins, 352 allowing for more rapid candidate generation when evaluating new putative drug 353 libraries or any other changes to the platform. We discovered that moderately