DrugRepo: A novel approach to repurpose a huge collection of compounds based on chemical and genomic features

The drug development process consumes 9-12 years and approximately one billion US dollars in terms of costs. Due to high finances and time costs required by the traditional drug discovery paradigm, repurposing the old drugs to treat cancer and rare diseases is becoming popular. Computational approaches are mainly data-driven and involve a systematic analysis of different data types leading to the formulation of repurposing hypotheses. This study presents a novel scoring algorithm based on chemical and genomic data types to repurpose vast collection of compounds for 674 cancer types and other diseases. The data types used to design the scoring algorithm are chemical structures, drug-target interactions (DTI), pathways, and disease-gene associations. The repurpose scoring algorithm is strengthened by integrating the most comprehensive manually curated datasets for each data type. More than 100 of our repurposed compounds can be matched with ongoing studies at clinical trials (https://clinicaltrials.gov/). Our analysis is supported by a web tool available at: http://drugrepo.org/.

The average cost of developing a new drug is billions of dollars, and it takes about 9-12 years to bring a new drug to the market [1] . Finding new uses for approved drugs has become a primary alternative strategy for the pharmaceutical industry. This practice, usually referred to as drug repositioning or drug repurposing, is highly attractive because of its potential to speed up the process of drug development, reduce costs, and provide treatments for unmet medical needs [2] .
In this regard, compounds that have passed through phases I or II in the drug discovery pipeline but never made it to the market due to efficacy issues carry great potential for drug repositioning. Traditionally, drug repurposing success stories have mainly resulted from largely opportunistic and serendipitous findings [3] . For example, sildenafil citrate was originally developed as an antihypertensive drug but later repurposed by Pfizer and marketed as Viagra to treat erectile dysfunction based on retrospective clinical experience, leading to massive worldwide sales. Other examples of such drug repositioning include cancer drugs: crizotinib, sorafenib, azacitidine and decitabine, all of which failed to reach the markets in their initial indications yet now are essential tools in the treatment of various types of cancers [4] .
Over the recent years, various computational resources are developed to support systematic drug repurposing. Popular information sources for in-silico drug repurposing include, for instance, electronic health records, genome-wide association analyses or gene expression response profiles, pathway mappings, compound structures, target binding assays and other phenotypic profiling data [3] . Several systematic review articles on the use of computational repurposing approaches are available that cover machine learning (ML) algorithms [5][6] [7] .
Several databases directly support in-silico drug repurposing, including Drug Repurposing Hub [8] , repoDB [9] and RepurposeDB [10] . On the other hand, hundreds of databases can indirectly support drug repurposing [7] [11] . However, these databases provide experimentally tested indications only for a limited number of investigational or approved compounds and ignore the massive number of preclinical compounds that could be potential candidates for drug repurposing. Drug target profiles for approximately two million such preclinical compounds are available at ChEMBL [12] and other databases.
Drug-target interactions (DTI), meaning the target molecules each compound binds to and the relative binding strength and impact on cellular functions, lie at the heart of drug discovery and repositioning. Several artificial intelligence (AI) methods for drug repurposing are based on DTIs as well as chemical structural similarities [13] [14] [15] . However, these methods are applied only to a selected set of compounds resulting in limited prediction outcomes [13] . Computational approaches are primarily data-driven and involve a systematic analysis of several components (or data types) before suggesting a repurposed indication. These components may include chemical structures, adverse event profiles, compound-target interactions, pathways, diseasegene associations, genomic, proteomic, and transcriptomic information. The drug repurposing methods can be developed based on the individual or combination of these components.
In this study, we propose DrugRepo (http://drugrepo.org/); a novel scoring algorithm that can effectively repurpose hundreds of thousands of compounds based on three components, 1) overlapping compound-targets score (OCTS), 2) structure similarity score based on Tanimoto coefficient (TC), and 3) compound-disease score (CDS). The DrugRepo score is computed between the approved drug (for a particular disease) and candidate compound and is the average of the three component scores. Approved indications for 674 diseases and 1,092 compounds are collected from https://clinicaltrials.gov/. To explore the translational impact of DrugRepo, we cross-referenced candidate compounds with completed clinical trials at https://clinicaltrials.gov/. We observed that 186 compounds are explored in different clinical studies across nine cancer types. We also compared our candidate compounds with the predicted compound disease relationships at the Comparative Toxicogenomic Database (CTD) [16] and found a statistically significant overlap. These promising findings demonstrate the versatility of DrugRepo. Our new tool provides a quick and effective scoring method for drug repurposing.

MATERIALS
Several types of data are integrated into this analysis, e.g., approved drug indications, compound-target profiles, disease-gene associations, and protein-protein interaction (PPI) networks. These datasets are consequently explained in the following subsections.

Approved drug indications
Approved drug indications are extracted from the clinical trials database (https://clinicaltrials.gov/), as it is the most up to date repository for drug indications and clinical phases for the compounds. However, the data provided by clinical trials is not well structured and doesn't provide standard naming conventions or identifiers for the compounds and diseases. We, therefore, utilized a semi-automated approach to extract drug-disease indications, assigned UML-CUI and standard InChIKey identifiers for drugs and diseases, respectively. The standard InChIKey mapping is performed using the PubChem python client (https://pubchempy.readthedocs.io/en/latest/), whereas UML-CUIs are assigned to the diseases using disease annotations provided by DisGeNET [17] . Finally, we extracted data for 674 diseases, 1,092 drugs, and 3,868 approved drug indications, as shown in Supplementary file 1.
These comprehensive datasets are extracted using application programmable interfaces (APIs), standalone text files, and SQL dumps. The first three databases, ChEMBL, BindingDB and GtopDB, provide quantitative bioactivity data, such as measurements in terms of IC50, Kd, and Ki, whereas DrugBank and DGiDB contain unary but experimentally verified compound-target interactions. In addition to active or potent compound-target profiles in ChEMBL, BindingDB and GtopDB, there exists a big proportion of in-active compound-target profiles (concentration > 10,000 nM). These in-active compound-target profiles could jeopardize the analysis in the proposed research. Therefore, in this analysis, we considered only potent compound-target profiles (concentration is <=1000 nM) [26] . Hence, we left with 788,078 compounds and 8,754 protein targets. Potent target profiles for these ~0.8M compounds are already integrated and publicly available in MICHA (https://micha-protocol.org/) [27] .

Disease-gene associations and PPI networks
To support the large-scale drug repurposing, we integrated manually curated disease-gene associations from DisGeNET [17] . There are 9,703 genes, 11,181 diseases and 84,038 associations. These curated disease-gene associations are provided in Supplementary file 2.

METHODS
There are 788,078 compounds for which there exists at least one potent target (concentration is ≤1000 nM) in any of the five DTI databases. We call these agents candidate compounds to be repurposed. For each candidate compound, the DrugRepo score is calculated as the average of three component scores, OCTS, TC, and CDS, which are derived by comparing each candidate compound to 1,092 approved drugs (Figure 1). Because the number of calculated scores (788,078 x 1092) is too big for the web portal to handle smoothly, we considered only those cases where the structural similarity between the approved drug and candidate compound is ≥ 0.2. This way, we were left with 2,207,367 scores. diseases. At first, the pipeline finds approved drug(s) for the selected disease and searches for structurally similar compounds. In this step, the Tanimoto coefficient (TC) describes the structural similarity between molecular fingerprints (ECFP4) of approved and candidate compounds. A threshold is used to favor similar molecular structures. The second step is to compute DTI profiles for candidate compounds and approved drugs. The OCTS is the score based on overlapping DTIs between approved and candidate compounds. In case of multiple approved drugs for a disease, we took average of OCTS and TC scores. The third step is to compute the compound-disease score (CDS). The CDS is the average of the minimum distances in the PPI networks between target molecules and molecules associated with the selected disease. The average distance is normalized to 0-1. Finally, the DrugRepo score is calculated as the average of the three component scores. The higher the DrugRepo score between the approved drug and the candidate compound, the higher the possibility of repurposing the compound for a particular disease. Finally, we developed the DrugRepo's GUI to provide a user-friendly service for repurposing drugs with our pipeline. The OCTS between approved and candidate compounds are computed using equation 1. The OCTS ranges from 0 to 1 and represents the proportion of targets shared between an approved drug and the candidate compounds. Candidate compounds sharing more targets with the approved drugs will have higher OCTS values. Where Compound are the sets of potent targets for a pair of approved drugs and candidate compounds, respectively. Similarly, | | |Compound | are the total number of targets associated with approved drug and the candidate compound respectively.
Compound-target profiles are extracted from five databases. The number of overlapping compounds and targets in these five databases are shown in Figure 2A and Figure  Drug repurposing is challenging because of shortcomings in data coverage. The diseases associated with significant number of approved drugs may have better chances of correctly repurposing the compounds as the number of candidate compounds will also be larger.
However, only very few diseases are associated bigger number of approved drugs. HIV is associated with the highest number of approved drugs (n = 103), but as shown in Figure 2C, more than 70% of diseases have less than five approved drugs. On the other hand, the lack of drug-target interactions is also a hurdle as it limits matching of compounds by the putative mechanism of action. Indeed, most approved drugs have less than 30 targets ( Figure 2D). To compensate for the shortage of approved drugs and drug-target-interactions, we incorporated two additional components in the DrugRepo pipeline: the Tanimoto coefficient (TC), which is a structural similarity score, and the compound-disease score (CDS), which ranks new compounds based on how closely their target spaces match with the target proteins that are associated with the disease.
The Tanimoto coefficient (TC) is measures structural similarities between molecular ECFP4 fingerprints of approved and candidate compound for a particular disease. The fingerprints are the bit strings denoting the presence or the absence of chemical substructures and are calculated using RDKit package [29].
Where are the number of sub-structures present in the approved drug and candidate compound, respectively, and are number of common sub-structures found in both approved drug and the candidate compound. The value of TC is between range 0-1 and constitutes the second component of DrugRepo.
Where = ( 1, 2, . . . ) is the set of gene targets for candidate compounds and = (g 1 ′ , g 2 ′ . . . ) are the genes associated with a particular disease (acquired from DisGeNET). The average of minimum distances between are computed in PPI networks. The average distance is further normalized to 0-1 using min-max normalization. Finally, the DrugRepo score is the mean of the three compound scores and ranges from 0 to 1. The higher the DrugRepo score between an approved drug (for a particular disease) and a candidate compound, the greater the repurposing potential of the candidate compound for that disease.

RESULTS AND DISCUSSIONS
To explore the translational impact of DrugRepo, we evaluated our repurposed compounds using two methods, i.e. 1) cross-referenced thousands of the repurposed compounds using disease-compound associations in CTD [16] , 2) matched 186 compounds across nine cancer types for which either Phase I or Phase II trials have been completed or Phase III trials is ongoing.

4.1.Matching repurposed compounds with disease-compound associations in CTD
The Comparative Toxicogenomic Database (CTD) contains manually curated and inferred compound-disease relationships [16] . CTD associates thousands of compounds with diseases based on drug-target and disease-gene relationships. Our scoring method and the datasets are different from CTD, but the output types in DrugRepo and CTD are the same. We, therefore, assessed the accuracy of DrugRepo by comparing the repurposed compounds with compounddisease relationships in CTD. We downloaded disease-compound relationships from CTD at:  Figure 3A ( is shown with blue and with red bars). As shown in To investigate the effect of structural similarity on drug repurposing, we evaluated the matched repurposed compounds on five different thresholds (TC: 0.5, 0.6, 0.7, 0.8, 0.9). As shown in Figure 3B, the number of matched repurposed compounds tends to decrease with strict TC filtration on repurposed compounds, as expected. However, the significance scores are also reduced, especially after TC ≥ 0.9, suggesting that high structure similarity is not a determining factor for drug repurposing. Many of the matched repurposed compounds are located at TC >= 0.5. On the other hand, the number of matched compounds and significance scores is relatively stable between 0.6 ≤ TC ≤ 0.7. Therefore, TC values between 0.6 to 0.7 might be optimal for drug repurposing.

The blue bars represent (N_expected ) the expected number of compounds, if chosen randomly while the red bar represents the actual overlap (Noverlap) between compounds in CTD and DrugRepo. (B) The Significance scores at different TC thresholds (with y-axis as thresholds and x-axis as a disease whose true positive is not 0). The significance score is represented by dot size, and the colour from red to blue represents number of overlapping compounds.
We also analysed whether diseases associated with a more significant number of approved drugs can affect the DrugRepo scoring. As shown in Figure 4A, if a specific disease is associated with a considerable number of approved drugs, then more repurposed compounds can be matched (correlation = 0.7). Similarly, the number of matched repurposed compounds (Noverlap) is also closely associated with the significance score ( Figure 4B). Conversely, DrugRepo performance remains poor for the diseases associated with fewer approved drugs.   (Figure 5E and 5F). The median of OCTS for different cancer types is lower (0.1-0.5) than CDS because complete target profiles (across the entire druggable genome) for most of the compounds are not experimentally tested. The average number of targets for the candidate and approved compounds for each of the five databases is less than 7 [27] . However, with the availability of additional high throughput DTI studies, the distribution of OCTS in DrugRepo score may also increase.

4.2.Matching DrugRepo candidate compounds with drugs in clinical trials
Setting a lower threshold on DrugRepo scores may result in more false positives (compounds not in clinical trials). So, we also analysed the proportion of hits across ten thresholds on DrugRepo scores. As shown in Figure 6,  Based on these successful matching, we can claim that a DrugRepo score >= 0.4 might guarantee the repurposing of a candidate compound with less chances of false positives. Not many compounds have been tested in clinical trials; therefore, we suggest top-scoring compounds be tested in-vitro to evaluate the significance of DrugRepo scores.

4.3.Using the DrugRepo's GUI to repurpose drugs for CML
We provide a case study on Chronic Myeloid Leukaemia (CML) using the web interface at http://drugrepo.org/. DrugRepo has three approved drugs (imatinib, nilotinib, and bosutinib) for CML. Users may check one or more of these approved drugs and customize the structural similarity and DrugRepo thresholds, as shown in Figure 7A.    components. However, with more components (such as gene expression data), results can be further improved. We will therefore incorporate these improvements in the next version of DrugRepo.

Key points
• We proposed a novel scoring algorithm for repurposing huge collection of pre-clinical compounds. The analysis is supported by web tool available at: http://drugrepo.org/ • DrugRepo score is based on three components i.e. molecular structural similarity (TC), Overlapping compound-target score (OCTS) and compound-disease score (CDS).
• DrugRepo GUI helps translational researchers to design new drug repurposing applications and to perform predictive analysis DECLARATIONS Availability of data and material: DrugRepo is available at http://drugrepo.org/. DrugRepo score between the approved drug and the candidate compound, the higher the possibility of repurposing the compound for a particular disease. Finally, we developed the DrugRepo's GUI to provide a user-friendly service for repurposing drugs with our pipeline.