Global mapping of pharmacological space

Paolini, Gaia V; Shapland, Richard H B; van Hoorn, Willem P; Mason, Jonathan S; Hopkins, Andrew L

doi:10.1038/nbt1228

Download PDF

Analysis
Published: 13 July 2006

Global mapping of pharmacological space

Gaia V Paolini^1,3^na1,
Richard H B Shapland^1,4,5,
Willem P van Hoorn^2,3,
Jonathan S Mason^3,6 &
…
Andrew L Hopkins^1,3^na1

Nature Biotechnology volume 24, pages 805–815 (2006)Cite this article

9742 Accesses
627 Citations
39 Altmetric
Metrics details

Abstract

We present the global mapping of pharmacological space by the integration of several vast sources of medicinal chemistry structure-activity relationships (SAR) data. Our comprehensive mapping of pharmacological space enables us to identify confidently the human targets for which chemical tools and drugs have been discovered to date. The integration of SAR data from diverse sources by unique canonical chemical structure, protein sequence and disease indication enables the construction of a ligand-target matrix to explore the global relationships between chemical structure and biological targets. Using the data matrix, we are able to catalog the links between proteins in chemical space as a polypharmacology interaction network. We demonstrate that probabilistic models can be used to predict pharmacology from a large knowledge base. The relationships between proteins, chemical structures and drug-like properties provide a framework for developing a probabilistic approach to drug discovery that can be exploited to increase research productivity.

Predicting protein-ligand interactions based on bow-pharmacological space and Bayesian additive regression trees

Article Open access 22 May 2019

Li Li, Ching Chiek Koh, … Dong-Qing Wei

OptADMET: a web-based tool for substructure modifications to improve ADMET properties of lead compounds

Article 23 January 2024

Jiacai Yi, Shaohua Shi, … Dongsheng Cao

Exploring protein hotspots by optimized fragment pharmacophores

Article Open access 27 May 2021

Dávid Bajusz, Warren S. Wade, … György M. Keserű

Main

The foundation for developing drug discovery into a knowledge-based predictive science lies, in part, in the assembly and integration of all medicinal chemistry structure-activity information¹. Although access to protein sequence data is widely available through global genome repositories, no such integrated databanks exist for medicinal chemistry structure-activity data. Public initiatives, such as the Harvard University (Cambridge, MA, USA) ChemBank Initiative², the US National Cancer Institute (Bethesda, MD, USA) Screening Database³ and the US National Institute of Mental Health's (Bethesda, MD, USA) Psychoactive Drug Screening Program K_i Database⁴, are important developments toward disseminating SAR data. However, most pharmacological data exists in proprietary screening databases, published documents, such as journal articles and patents, and a growing variety of commercial databases. The lack of accepted data standards and data integration thus prevents knowledge discovery and data-mining efforts from learning from the output of the significant annual private and public investment in pharmaceutical research.

To navigate chemogenomic knowledge space, we have created a comprehensive assembly of annotated pharmacological data^3,4,5,6,7,8. We have also designed a unified data model to enable the global mapping and measurement of pharmacological space (that is, biologically active chemical space) by the integration of diverse data sources into a single data warehouse. Although a possible alternative to this would be a federated approach, we found that a single database model better fitted with our data-integration vision as well as with our practical, architectural and technical constraints. We applied the principle of knowledge discovery in databases to the design^9,10, including data conversion, cleaning and transformation. We found that having all the data in one place offers greater control for entity indexing and data retrieval and management, enabling us to perform global mapping. Ultimately, we believe that the implementation, although important, is a separate issue and it is the integration concept and the data model, however physically realized, that matter. The data are integrated by chemical structure, using unique canonical representations, including the often-neglected issue of tautomers. Assay data are assigned to targets by protein sequence, and indications indexed by a disease code. Thus, both chemoinformatics and bioinformatics techniques can be applied directly to the data-mining of the integrated data set.

At present, the data warehouse contains 4.8-million nonredundant chemical structures, over 275,000 of which are classified as biologically active. Over 600,000 SARs of molecular binding (e.g., IC₅₀; inhibitor concentration required for 50% inhibition of the normal reaction) data from Pfizer's internal screening files are integrated with commercial screening data, competitive intelligence on approved and investigational drugs and key components of the past 25 years of published medicinal chemistry data.

Pharmacological target space

Large-scale data integration of proprietary and published screening data enables the identification of the number of unique molecular targets, as represented by protein sequences, for which chemical tools, leads or drugs have been discovered. Because of the lack of integrated knowledge bases in pharmaceutical research, the list of molecular targets for which small-molecule chemical matter has been discovered has been difficult to ascertain^11,12,13,14. We have assigned 2,876 targets to protein sequences from 55 organisms, with biologically active chemical tools for 1,306 proteins. However, because of orthologs among species, many of the mammalian genes are redundant.

In total, we can unambiguously identify 836 genes in the human genome for which small-molecule chemical tools have been discovered (the threshold of biological activity is defined throughout as a binding affinity <10 μM). When Lipinski's rule-of-five criteria for oral drug absorption¹⁵ are applied, 727 human targets have at least one compound with a binding affinity <10 μM and 529 human targets have at least one compound with a binding affinity <100 nM that satisfy the rule-of-five (Table 1). Of the pharmacological targets selected, 158 human proteins have been identified as the primary modes-of-action for approved small-molecule drug targets with oral small-molecule drugs primarily targeting only 141 human proteins.

Table 1 Pharmacological target space^a

Full size table

Polypharmacology

A key question in global pharmacological space is how extensive is promiscuity, which is defined as the specific binding of a chemical to more than one target. Considering each pair of targets in turn, if two proteins both bind to the same ligand, they can be considered as interacting in chemical space, even if they have no other interaction in physical space or similarity in sequence space. The concept of 'target-hopping,' where chemical matter for one target can be considered as the basis for leads or tools for another target has historically been an extremely fruitful method of drug discovery^16,17,18. The entire database was analyzed to ensure that nonspecific aggregation inhibitors¹⁹ did not bias the results. Of all the 276,122 active compounds found in our database, 65% have recorded activity for one target, whereas 35% are observed to hit more than one target.

We have mapped the observed polypharmacology interaction network for human proteins (Fig. 1) to navigate polypharmacology relationships between targets. Each node of the network is a human target for which we found active lead matter. Two nodes are connected if they share active matter. The strength of this connection (P_ij) is defined in the Methods section. Calculation of the polypharmacology network enables the visualization of the interactions between proteins in chemical space. The entire protein interaction network for human proteins, calculated from our database, consists of 700 proteins (nodes) connected by 12,119 interactions (edges) for all compounds below the affinity threshold of 10 μM and with a difference in affinity of up to three orders of magnitude between two targets. Interestingly, the structure of the network is robust to changes in the window of fold-differences in affinity; 696 proteins (nodes) are connected by 11,591 interactions (edges) for all compounds with an affinity threshold <10 μM that have a difference in affinity of up to two orders of magnitude between two targets, and 675 proteins (nodes) are connected by 10,016 interactions (edges) for all compounds with an affinity threshold <10 μM that have a difference in affinity of up to one order of magnitude between two targets. We should stress, however, that the SAR matrix is far from complete, and new data becoming available could alter the appearance of the network, as noted by Vieth et al.²⁰.

**Figure 1: Human polypharmacology interaction network representing relationships between proteins in chemical space.**

Promiscuity can be considered from the perspective of both the compound and the pharmacological target, to measure compound selectivity and target overlap^20,21,22. We evaluated the degree of promiscuity of each target in three different ways (see Methods section for definitions). Table 2 shows the top ten promiscuous targets obtained using the different methods. Method one (P₁) consists of calculating a target's promiscuity as the proportion of ligands shared with other targets, multiplied by the average number of targets that each of the target's ligands is active against. This definition promotes targets whose ligands are predominantly promiscuous, with a high number of other targets. The second method (P₂) uses the polypharmacology network. This promiscuity index is calculated by counting the number of connections of each target (edges connected to each node in the network). This definition promotes targets that are connected to a large number of other targets, regardless of the strength of the interaction. The third definition (P₃) again uses the polypharmacology network, but this time the strength of the connections (P_ij) is used in the summation. It is apparent that the different definitions of promiscuity highlight different effects, although the same target classes (aminergic G protein–coupled receptors (GPCRs), cytochrome P450s and protein kinases) appear at the top positions (Table 2). By comparing the rankings of targets resulting from using P₁, P₂ and P_3, we find that P₁ is correlated with neither P₂ nor P₃ (R < 0.5) whereas P₂ and P₃ are strongly correlated (R = 0.9). This is consistent with the fact that P₂ and P₃ are calculated using the same network, but also supports the view that connectivity, regardless of the relative strengths of the connections, is the important ingredient in the structure of the polypharmacology interaction network.

Table 2 Most promiscuous human proteins calculated using P₁, P₂ and P₃ promiscuity indexes^a

Full size table

The majority of compounds are active against targets within the same gene family. However, as we observed from the structure of the polypharmacology interaction network, there is significant interaction between gene families. A quarter of all the promiscuous compounds have been observed to be active across different gene families. To visualize the polypharmacology interactions at gene-family level, we have summarized the target-target interaction network by summing all the P_ij values by gene family (G_kl, see Methods sections). The resulting matrix is shown in Figure 2 (see data supplied in Supplementary Table 1 online). Using this matrix, we can illustrate the cumulative strength of intra- as well as inter-gene family connections, the latter represented by the off-diagonal cells. Aminergic GPCRs and protein kinases exhibit the greatest intra- as well as inter-gene family promiscuity.

**Figure 2: Degree of intra- and inter-gene family promiscuity illustrated as a polypharmacology interaction matrix.**

Bayesian predictions of pharmacology

We decided to investigate the construction of a virtual array of predictive pharmacology models derived from the analysis of the large-scale integrated SAR data. Using a Laplacian-modified Bayesian classifier approach^23,24, 698 target-specific predictive models were built. All the compounds classified as biologically active in the database were filtered by chemical quality criteria. Of the remaining compounds, 10% were removed for the test set (23,792 compounds with 55,781 measurements) and 90% of the data (214,128 compounds with 561,913 measurements) were used to build the predictive models. The Bayesian model for each target was built using the training set where all compounds are classified as either active (endpoint < 10 μM for that target) or inactive (the rest). A Bayesian model prediction is a number describing confidence of activity: the larger the score, the more confidence the compound is active, but no quantitative prediction of affinity is made. Similarly, a large negative score indicates high confidence of inactivity, and finally, a score close to zero is a neutral prediction. Bayesian prediction scores for all test set compounds were calculated across the bank of 698 models.

The success rates of the combined predictive models above the random baseline prediction are shown in Figure 3 (see data in Supplementary Table 2 online). All Bayesian scores greater than or equal to the cutoff are interpreted as predictions of activity. For example, at the confidence score cutoff of 50, 72% of compounds in the test set have at least one prediction, and 64% have at least one correctly predicted target in common with an experimental target, whereas only 4% were incorrectly predicted. At the Bayesian score of 50, a total of 58,428 biological activities are predicted, 56.7% of which are correct, representing a 153-fold enrichment over random. The predicted false-negative rate is 13%, whereas 26,828 false positives are apparently predicted. As the measured ligand-target matrix is only 0.4% full, many of the false-positive predictions may indeed still be true.

**Figure 3: Bayesian predictions of pharmacology.**

In addition to predicting primary pharmacology, we wanted to ascertain whether the models could be used to predict polypharmacology. To explore this problem, we have done a preliminary investigation with Cerep's (Paris) 'BioPrint' data set, which is a nearly complete matrix of measured activities of 997 compounds against 316 targets. Results of these studies can be found in the Supplementary Figures 1 and 2 online and Supplementary Tables 3 and 4 online. These initial studies indicate that probabilistic models built from integrated medicinal chemistry SAR data are a promising approach for predicting primary pharmacology across a large number of protein targets. In terms of polypharmacology, intra-gene family promiscuity is predicted with the highest confidence. Inter-gene family interactions are a much harder problem because of the sparse nature of the ligand-target matrix.

Relationship between molecular properties and target class

We calculated a set of physicochemical descriptors for all compounds in the database to investigate the relationship between target class and the physicochemical properties of ligands^13,25. The protein sequences assigned to each of the pharmacological targets were classified into gene families. Distinct differences in the distribution of molecular properties between sets of compounds active against different gene families are observed (Table 3, Fig. 4 and Supplementary Fig. 3 online). For example, the mean molecular weight (MW) of ligands binding to aminergic GPCRs is 378 Da (s.d. = 93 Da), whereas the mean MW of peptide GPCR ligands is greater at 514 Da, but with a wider spread (s.d. = 202 Da). Ligands for the nuclear hormone receptors are the most lipophilic, as measured by calculated octanol/water partition coefficient (clogP), mirroring the properties of steroids. Overall, the properties of the synthetic ligands reflect the differences in the properties of the endo-genous ligands for each target class.

Table 3 Molecular properties of gene family ligands

Full size table

**Figure 4: Molecular weight (MW) distribution of compounds by gene family.**

The distribution patterns illustrate that, although there are distinctions in the physical properties of the ligands, using a single property to discern separate gene families is too crude. We wanted to investigate whether ligands for specific gene families may be selected within a range of property parameters. Using a 184,687-compound subset of the data as a training set, linear discriminant analysis was used to classify 41,823 compounds by target class using only the calculated physicochemical molecular properties. The data set used for the linear discriminant analysis consisted of the subset of compounds that bind to members of exactly one target class. Overall, this simplistic method successfully classified 34% of ligands to their respective target classes, with an overall enrichment ratio over random of 6.9 (Supplementary Table 5 online). The results are interesting as they suggest that simple calculated molecular properties can be used as a crude classifier of a compound's biological activity, by gene family.

Industrial trends of compounds, targets and attrition

We have witnessed a remarkable growth in the number of reported targets and compounds disclosed in the medicinal chemistry literature, mirroring the rise in investment in pharmaceutical research. In recent years, the number of targets screened, including selectivity counter-screens, published in the medicinal chemistry literature, has been growing drastically. Screening data on nearly 900 proteins are currently published each year, of which >500 molecular targets are reported with potent chemical matter (that is, IC₅₀ < 100 nM). Currently, potent novel chemical tools and leads are first disclosed for ∼80–100 new molecular targets each year (Fig. 5a). No doubt, this is a conservative estimate as many new compounds and targets are only disclosed in patents, which are not included in this initial literature analysis. The increase in the rate of discovery of chemical tools for new targets doubled from an average of 30 new targets with leads being disclosed in the 1980s to an average of 60 new targets per year in the 1990s. In comparison, an average of four new targets, for first-in-class drugs, have reached the market each year during the 1990s¹³.

**Figure 5: Trends in medicinal chemistry of compounds in the database.**

That said, we have yet to see the increase in new targets with leads translating into a proportionate increase in the number of approved first-in-class drugs. An analysis of the targets of published compounds reveals some significant trends in the changing character of the industry's portfolio of targets and target classes (Fig. 5b), such as a relative decline in proportion of aminergic GPCRs in the industry's target portfolio and an increase in protein kinases.

Over the past 25 years, there has been a steady, inexorable rise in the median MW of reported medicinal chemistry compounds (Fig. 5c). Comparing 5-year averages from 1986–1990 to those of 1999–2003, the median MW of all reported medicinal chemistry compounds in the literature rose 68 Da (∼20%) from 354 Da to 422 Da, respectively. Interestingly, this growth is also reflected in the increase of the median MW of disclosed ligands for several gene families. For example, compounds binding to aminergic GPCRs have increased in MW by around 56 Da, from 337 Da to 393 Da between the two 5-year periods. No significant increase in mean or median potency is observed in the data to explain the increase in MW. Even so, this rise in MW contrasts with the steady state of the mean MW of approved drugs²⁶ and the steady decline in MW through each subsequent stage of clinical development and increase in the proportion of compounds that are rule-of-five compliant^27,28 (Fig. 5d).

Of course, these calculations combine all target classes together; in contrast, the industry's target portfolio is unlikely to be in a steady state, with some target classes emerging and others declining in popularity. The relative difference in molecular properties among the gene families is also reflected in compounds in clinical development; however, again we notice that, even within a gene family, the median MW of compounds surviving subsequent clinical phases is declining slightly (Fig. 5d).

Degrees of druggability

A key objective of our global analysis of pharmacological space is to build the foundation of probabilistic approaches to drug discovery. Trends from marketed and investigational drugs indicate that oral drug space is limited by the biophysical barriers to absorption and permeability in the human body^{15,26,29,30,31,32,33,34,35,36,37,38,39,40}. Because we have observed that the molecular properties of ligands are correlated with their target class, it follows that we should be able to identify those targets with a higher probability to produce drug-like chemical matter. Rather than considering target druggability as a binary state, it can be thought of as a probabilistic continuum, where two targets may both be classified as druggable¹³ but may exhibit considerable differences in their probabilities of success.

Lipinski introduced the concept of upper physicochemical property limits, above which drug permeability and absorption are less likely¹⁵. Like Lipinski, we use the simple molecular properties of clogP, number of hydrogen-bond acceptors (H-acc) and number of hydrogen-bond donors (H-don) as the dimensions of a reduced chemical space. Oral drugs are still the primary focus of pharmaceutical research; therefore, we calculated the properties of 617 approved oral drugs in the reduced chemical space (for which we calculated a centroid at MW = 316, clogP = 2.3, H-acc = 4 and H-don = 2). Figure 6a illustrates the population distribution of oral drugs in two-dimensional molecular property space as an interpolated contour map.

**Figure 6: Chemical space of drugs and leads.**

In terms of drug targets Figure 6b shows the distribution of median molecular properties for all compounds for each of the human oral-drug targets. For each target, the molecular properties are averaged over all its potent active compounds (<100 nM), including oral drugs and leads. Figure 6c shows the same quantities, this time for all human targets with potent active compounds. Comparison of these two figures shows that a significant number of targets are outside the rule-of-five boundaries.

Given the set of active compounds observed for a target, could the ligand properties in reduced chemical space provide a guide to quantifying the likelihood of the target to produce an oral drug? As a first approximation, the degree of druggability of the target can be described as the distance D_T between the target T and the oral drugs, in reduced chemical space. This distance is expressed as a function of the deviation of the centroid of each target from the ideal value of the oral-drugs distribution (see Methods section). The resulting distance ranges from 0 to 1, with ideal value being 0. If we compare the results for all human targets (excluding known drug targets) versus human oral drug targets, we observe an enrichment in the degree of druggability of drug targets versus all the remaining human targets. We find that 87% of human oral-drug targets have D_T ≤ 0.6, and 65% have D_T ≤ 0.4. Of the remaining human targets, 68% have D_T ≤ 0.6, and 39% have D_T ≤ 0.4. This means that ∼200 of the remaining targets have a relatively high degree of druggability (D_T ≤ 0.4), but have yet to realize this potential.

Discussion

The large-scale integration of medicinal chemistry and pharmacological data enables for the first time the global surveying and navigation of the biologically active chemical space (pharmacological space). Our initial investigations illustrate how the pharmacological target space of potential drug targets is a function of the physicochemical property filters applied to the ligands^13,25. The number of proteins for which chemical tools has been identified is significantly higher than previous estimates^11,12,13. The compilation and dissemination of chemical tools identified in a global survey, such as this, could be the basis of a rich chemical toolbox for chemogenomics^7,25,41,42, providing that the proper legal safeguards and respect for intellectual property are observed.

The comprehensive cataloging of biologically active chemicals also fosters the development of systematic ontologies for pharmacology and medicinal chemistry^43,44. The concept of relating proteins in chemical space by polypharmacology interactions provides the foundation for a ligand-based protein classification and valuable resource for understanding the molecular basis for compound promiscuity^5,45,46. Our initial focus has been on in vitro binding and selectivity data. Although molecular data provide us with invaluable insights into molecular recognition, ultimately they need to be integrated with gene expression and phenotypic end-points from in vivo and clinical observations if we are to capture the relationships between molecular binding across the proteome with efficacy or toxicity.

In our opinion two interesting potential applications of this work are polypharmacology and probabilistic modeling. The mapping of polypharmacology networks enables us to start considering the rational design of selectively promiscuous agents, thereby expanding the opportunity space for new medicines. Approaching drug discovery as a probabilistic enterprise based on a priori knowledge with an understanding of the varying degrees of druggability, promiscuity and attrition risks may be a significant advance in attempting to increase research productivity. As the vast majority of all drug discovery projects and clinical candidates fail the exacting criteria for safe human medicines, what we are left with are the learning and data that can contribute to the refinement of predictive models, for the benefit of all. Realization of the importance of the integration of our accumulated data can provide the basis for a significant improvement in our knowledge of success factors in the drug discovery enterprise.

Methods

Database and data model.

Our physical database consists of a single central Oracle 9.2 data warehouse. We store chemical structures as Simplified Molecular Input Line Entry Specification (SMILES) strings (http://www.daylight.com and we use the Daylight DayCart Oracle Cartridge (http://www.daylight.com/) for structure indexing and manipulation. We chose SMILES as a database-friendly representation, as it is a compact, simple character syntax, encoding a self-contained language with its own controlled vocabulary and enabling unique canonical representations of structures, in which stereochemical descriptions can easily be defined or relaxed when querying structural data. Our data model is fully normalized to avoid bias toward specific data queries. It is chemo-centric, in that we use chemical structures as the key to information storage and retrieval. This means that all the different entity types are ultimately connected to chemical structures. We are aware that methods for chemical representation are not fully mature and sometimes subjective. Consider for instance the perception of tautomeric equivalence (for example, the two unsubstituted nitrogens in an imidazole ring, one of which nominally needs to have a hydrogen atom attached) and tautomeric relations (where one tautomer may be considered more stable and thus the preferred drawing form or indeed where the tautomeric forms are considered to require chemical transformation). As a consequence, related software and rules are likely to change, expand and improve with time. For this reason we have designed the data model to handle multiple concurrent representations (that is, multiple SMILES strings) for any given compound. This way a fully flexible view of chemical structures and their connections can be achieved. We produced a single unified data warehouse integrating, by chemical structure, protein sequence and indication, the Pfizer's structure-activity data (e.g., IC₅₀, EC₅₀ (concentration of a compound where 50% of its effect is observed), K_i, K_d, excluding high-throughput screening percentage inhibition data), which contains data from legacy Pfizer (New York), Warner-Lambert (formerly of Morris Plains, NJ, USA, now part of Pfizer) and Pharmacia (formerly of Kalamazoo, MI, USA now part of Pfizer); the Inpharmatica (London) StARLITe database, which contains data extracted from Journal of Medicinal Chemistry (issues January 1980–Sept 2004) and Bioorganic Medicinal Chemistry Letter (issues January 1991–September 2004); the Cerep BioPrint database⁵ and summary data from the Thomson (New York) Current Drugs Investigational Drugs Database (IDDB) (http://scientific.thomson.com/products/iddb/). The current database contains 4.8-million unique chemical structures with protein identifiers and sequences assigned to 2,876 targets with assay measurements; 526,548 assay measurements are related to 276,122 active chemical structures.

Extraction, transformation and loading (ETL).

Before being fully integrated into our database, the original data sources were first loaded into Oracle staging tables. This was achieved using a combination of tools (Servefile's Java-based data loader, Oracle, Pipeline Pilot). The staging tables were processed to perform data selection, cleaning, mapping and standardization. This ETL procedure is the most critical and time-consuming part of knowledge discovery in databases involving a blend of disciplines, namely scientific-domain expertise, logic and informatics. Data fields from the different data sources were identified and selected. Metadata tables were created to map together different conceptualizations of the same entities (that is, different ontologies). Data quality issues, ranging from spelling mistakes to entity misassignment, were addressed and contained or flagged. Chemical structures were standardized at different levels depending on the chosen representation. Wherever practical, all entities that could be enumerated (e.g., units of measure, country codes) were mapped to controlled vocabularies. At the end of this process, data were fully integrated at a scientific level for data mining. Diseases were mapped onto a disease taxonomy derived from the Medical Dictionary for Regulatory Activities (MedDRA). Protein sequences were directly mapped to assays in all cases where the protein could be unambiguously identified.

Data access.

To be of practical use, a data repository needs to be easily accessible. This requirement is at odds with the principles of data normalization⁴⁷ and flexibility of representation. It is therefore customary to separate the data warehouse (and data-loading activities) from access layers (data retrieval). The latter are usually data marts, sets of database tables where data are regrouped in a different way, optimized to answer specific questions. The advantage of having data marts is that queries are prepackaged and therefore faster. The drawback is that data must be copied from the data store to the data marts. This causes additional issues such as disk space shortage and scheduling of data updates and downtime. We believe that data marts are the right solution where the most common queries are already known and routinely performed. Because our database system was still highly experimental, and the number of questions we wanted to ask very high, we designed an alternative approach. We built a set of components (using Scitegic's (San Diego, CA, USA) Pipeline Pilot 4.5; http://www.scitegic.com/) to query, manipulate and filter the data. The lower-level components could be combined, and results from a query could be refined and/or fed into subsequent queries, generating sets of hit lists. This approach offered two advantages. The first was to perform an experimental benchmarking of the database, to find where data marts would be mostly needed to improve performance, and how the most commonly asked questions could be identified, grouped together and packaged. The second was to offer a great flexibility in interrogating the database, allowing us to cross-link the different entities in every possible way. We found that the performance drawback was acceptable for a system at this stage of maturity, mostly used for statistical analysis and post-processing, rather than for fast online data retrieval.

Preparation and analysis of chemical structures

. In the study described here, all chemical structures were standardized using DayCart 4.82. A further processing step, to remove inconsistencies and identify salts and mixtures, was performed using a Pipeline Pilot protocol written in house. For the purpose of this study, all salts were stripped off the structures and the canonical tautomer of each resulting structure was identified using a standard Pipeline Pilot component. The resulting desalted canonical tautomers were loaded onto the database and used for structure matching. Molecular properties were either stored or calculated on the fly using standard Pipeline Pilot components.

Analysis of biological activity results.

N-point results (e.g., IC₅₀, EC₅₀, K_i and K_d) were collected for all the molecular targets that we mapped to gene sequences. Biological assays related to more than one gene (where the particular target could not be identified or where more than one target was involved) were kept separate. The analysis here refers to the cases where a given assay was related to a single gene. The active compounds were selected among the compounds where the best resulting activity (combining all the N-point measurement types) was found to be <10 μM. This is our definition of active compounds throughout the paper. Outliers in the biological activity results were identified with a simple automated protocol based on calculating the average distance:

of each result in a set from all other results and flagging the ones whereDi−D_min > 1. Here the symbols i and j refer to results from different assay experiments for the same compound and target. D_min is the minimum distance among all pairs of these results. The flagged sets were then manually checked and the outliers removed from the analysis.

Polypharmacology interaction network.

The strength of polypharmacology interactions (P_ij) between two targets i and j was calculated, for all active compounds in the database, as follows:

where N_ij^tested is the number of compounds commonly tested against target i and j. N_ij is the number of compounds observed to bind to both targets i and j below the compound promiscuity threshold; a compound is considered shared between targets i and j if there is less than an n log difference in potency (where n = 1 is a tenfold difference in potency, n = 2 is a 100-fold difference in potency, n = 3 is a 1,000-fold difference in potency).

Each log order difference in potency represents a binding energy difference of ΔΔG = −1.4 kcal/mol. We used Cytoscape⁴⁸ (http://www.cytoscape.org/) to display the interaction network in Figure 1 for n = 3. The cumulative effect of polypharmacology interactions between different targets of the same or different gene families is represented by the elements of the summarized matrix in Figure 2, calculated as

Only the cells for which enough statistics were available (N_ij^tested > 10) were included in the summation. A potency-difference window of n = 1 was used.

Figure 2 was produced using Spotfire (Somerville, MA, USA) Decision Site 7.2 (http://www.spotfire.com/).

Calculation of promiscuity indices.

We evaluated the promiscuity of a target T in three different ways (P₁, P₂ and P₃).

The first index was defined as follows:

with

where N_actives(T) is the number of active compounds of target T and N_totalshared(T) is the number of active compounds of target T for which the compound promiscuity index (P_C) > 1. (P_C of a compound (C) is defined as the total number of targets that the compound is active against.)

The second index is

where I_Tj is a matrix identical to the polypharmacology matrix, with all the values where P_ij is nonzero substituted by ones.

The third index was calculated by summing along rows of the matrix itself as

Bayesian model building.

Compounds were filtered to remove structures with MW >1,000 Da and those that failed structural quality filters (e.g., toxicophores, aggregation inhibitors, reactive groups). After the filtering, there were 617,694 experimental activities from 238,655 compounds covering 698 targets. Protein targets with fewer than ten biologically active compounds after filtering were also removed from the data set. Compound structures were transformed into FCFP_6 functional-class fingerprints. Data preparation, Bayesian analysis and model building were implemented using the Scitegic Pipeline Pilot Laplacian-corrected Bayesian classifier^23,24 algorithm. This implementation of Bayesian statistics uses information from both the active and inactive compounds from the training set and removes features from the model, which are deemed not to be important.

Linear discriminant analysis.

The biologically active compounds were filtered by chemical quality criteria to remove aggregation inhibitors and compounds with potentially reactive groups. Compounds active against more than one gene family were also removed. Of the remaining compounds, 184,687 were selected as a training set for the linear discriminant analysis (as implemented in R⁴⁹) to classify the gene family activity on a test set of 41,823 compounds. The classification was based on the following calculated molecular properties: MW, number of hydrogen-bond acceptors, number of hydrogen-bond donors, number of rotatable bonds, molecular surface area, molecular polar surface area, number of ionizable centers, clogP, Andrews' binding energy⁵⁰ and predicted molecular solubility.

Distances in reduced chemical space.

We have prepared a set of 617 US Food and Drug Administration–approved oral drugs and calculated their MW, clogP, number of hydrogen-bond acceptors (H-acc), number of hydrogen-bond donors (H-don), using standard and in-house Pipeline Pilot components. In these components the H-acc atoms are defined as heteroatoms (oxygen, nitrogen, sulfur or phosphorus) with one or more lone pairs, excluding atoms with positive formal charges, amide and pyrrole-type nitrogens and aromatic oxygen and sulfur atoms in heterocyclic rings. H-don atoms are defined as heteroatoms (oxygen, nitrogen, sulfur or phosphorus) with one or more attached hydrogen atoms. These four properties are used to characterize the set in reduced chemical space. We have collected all the targets which either have potent active compounds (below 100nM) or are drug targets, and calculated the centroid {MW_T, clogP_T, H-acc_T, H-don_T} for each of these targets. The distance in reduced chemical space is defined for each target T as

where the function f_k, k={MW, clogP, H-don, H-acc} represents the distribution of values of the molecular properties of oral drugs, normalized so that f_k € [0,1].

Note: Supplementary information is available on the Nature Biotechnology website.

References

Schuffenhauer, A. & Jacoby, E. Annotating and mining the ligand-target chemogenomics knowledge space. Drug Discov. Today: BIOSILICO 2, 190–200 (2004).
Article CAS Google Scholar
Strausberg, R.L. & Schreiber, S.L. From knowing to controlling: a path from genomics to drugs using small molecule probes. Science 300, 294–295 (2003).
Article CAS Google Scholar
Weinstein, J.N. et al. An information intensive approach to the molecular pharmacology of cancer. Science 275, 343–349 (1997).
Article CAS Google Scholar
Roth, B.L., Kroeze, W.K., Patel, S. & Lopez, E. The multiplicity of serotonin receptors: uselessly diverse molecules or an embarrasment of riches? Neuroscientist 6, 252–262 (2000).
Article CAS Google Scholar
Krejsa, C.M. et al. Predicting ADME properties and side effects: the BioPrint approach. Curr. Opin. Drug Discov. Develop. 6, 470–480 (2003).
CAS Google Scholar
Horvath, D. & Jeandenans, C. Neighborhood behavior of in silico structural spaces with respect to in vitro activity spaces-a novel understanding of the molecular similarity principle in the context of multiple receptor binding profiles. J. Chem. Inf. Comput. Sci. 43, 680–690 (2003).
Article CAS Google Scholar
Root, D.E., Flaherty, S.P., Kelley, B.P. & Stockwell, B. Biological mechanism profiling using an annotated compound library. Chem. Biol. 10, 881–892 (2003).
Article CAS Google Scholar
Wallqvist, A. et al. Mining the NCI screening database: explorations of agents involved in cell cycle regulation. Prog. Cell Cycle Res. 5, 173–179 (2003).
PubMed Google Scholar
Piatetski-Shapiro, G. & Frawley, W. Knowledge Discovery in Databases (MIT Press, Cambridge, 1992).
Google Scholar
Klösgen, W. & Zytkow, J.M. (eds.). Handbook of Data Mining and Knowledge Discovery (Oxford University Press, Oxford, 2002).
Google Scholar
Drews, J. Genomic sciences and the medicine of tomorrow. Nat. Biotechnol. 14, 1516–1518 (1996).
Article CAS Google Scholar
Drews, J. & Ryser, S. Classic drug targets. Nat. Biotechnol. 15, 1318–1319 (1997).
Article CAS Google Scholar
Hopkins, A.L. & Groom, C.R. The druggable genome. Nat. Rev. Drug Discov. 1, 727–730 (2002).
Article CAS Google Scholar
Golden, J.B. Prioritizing the human genome: knowledge management for drug discovery. Curr. Opin. Drug Discov. Develop. 6, 310–316 (2003).
CAS Google Scholar
Lipinski, C.A., Lombardo, F., Dominy, B.W. & Feeney, P.J. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Del. Rev. 23, 3–25 (1997).
Article CAS Google Scholar
Van Gestel, S. & Schuermans, V. Thirty-three years of drug discovery and research with Dr. Paul Janssen. Drug Dev. Res. 8, 1–13 (1986).
Article CAS Google Scholar
Sneader, W. Drug Prototypes and Their Exploitation (Wiley, London, 1996).
Google Scholar
Wermuth, C.G. Selective optimization of side activities: another way for drug discovery. J. Med. Chem. 47, 1303–1314 (2004).
Article CAS Google Scholar
McGovern, S.L., Helfand, B.T., Feng, B. & Shoichet, B.K. A specific mechanism of nonspecific inhibition. J. Med. Chem. 46, 4265–4272 (2003).
Article CAS Google Scholar
Vieth, M. et al. Kinomics—structural biology and chemogenomics of kinase inhibitors and targets. Biochim. Biophys. Acta 1697, 243–257 (2004).
Article CAS Google Scholar
Vieth, M., Sutherland, J.J., Robertson, D.H. & Campbell, R.M. Kinomics: characterizing the therapeutically validated kinase space. Drug Discov. Today 10, 839–846 (2005).
Article CAS Google Scholar
Frye, S.V. Structure-activity relationship homology (SARAH): a conceptual framework for drug discovery in the genomic era. Chem. Biol. 6, R3–R7 (1999).
Article CAS Google Scholar
Xia, X., Maliski, E.G., Gallant, P. & Rogers, D. Classification of kinase inhibitors using a Bayesian model. J. Med. Chem. 47, 4463–4470 (2004).
Article CAS Google Scholar
Rogers, D., Brown, R.D. & Hahn, M. Using extended-connectivity fingerprints with laplacian-modified Bayesian analysis in high-throughput screening follow-up. J. Biomol. Screen. 10, 682–686 (2005).
Article CAS Google Scholar
Lipinski, C. & Hopkins, A. Navigating chemical space for biology and medicine. Nature 432, 855–861 (2004).
Article CAS Google Scholar
Vieth, M. et al. Characteristic physical properties and structural fragments of marketed oral drugs. J. Med. Chem. 47, 224–232 (2004).
Article CAS Google Scholar
Wenlock, M.C., Austin, R.P., Barton, P., Davis, A.M. & Leeson, P.D. A comparison of physiochemical property profiles of development and marketed oral drugs. J. Med. Chem. 46, 1250–1256 (2003).
Article CAS Google Scholar
Blake, J.F. Examination of the computed molecular properties of compounds selected for clinical development. Biotechniques (June) Suppl.,16–20 (2003).
Ajay, A., Walters, W.P. & Murcko, M.A. Can we learn to distinguish between “drug-like” and “nondrug-like” molecules? J. Med. Chem. 41, 3314–3324 (1998).
Article CAS Google Scholar
Lipinski, C.A. Drug-like properties and the causes of poor solubility and poor permeability. J. Pharmacol. Toxicol. Methods 44, 235–249 (2000).
Article CAS Google Scholar
Wang, J. & Ramnarayan, K. Towards designing drug-like libraries: a novel computational approach for prediction of drug feasibility of compounds. J. Comb. Chem. 1, 524–533 (1999).
Article CAS Google Scholar
Walters, W.P. Ajay & Murcko, M.A. Recognizing molecules with drug-like properties. Curr. Opin. Chem. Biol. 3, 384–387 (1999).
Article CAS Google Scholar
Podlogar, B.L., Muegge, I. & Brice, L.J. Computational methods to estimate drug development paramenters. Curr. Opin. Drug Discov. Devel. 4, 102–109 (2001).
CAS PubMed Google Scholar
Muegge, I., Heald, S.L. & Brittelli, D. Simple selection criteria for drug-like chemical matter. J. Med. Chem. 44, 1841–1846 (2001).
Article CAS Google Scholar
Veber, D.F. et al. Molecular properties that influence the oral bioavailability of drug candidates. J. Med. Chem. 45, 2615–2623 (2002).
Article CAS Google Scholar
Proudfoot, J.R. Drugs, leads, and drug-likeness: an analysis of some recently launched drugs. Bioorg. Med. Chem. Lett. 12, 1647–1650 (2002).
Article CAS Google Scholar
Egan, W.J., Walters, W.P. & Murcko, M.A. Guiding molecules towards drug-likeness. Curr. Opin. Drug Discov. Devel. 5, 540–549 (2002).
CAS PubMed Google Scholar
Walters, W.P. & Murcko, M.A. Prediction of 'drug-likeness'. Adv. Drug Deliv. Rev. 54, 255–271 (2002).
Article CAS Google Scholar
Muegge, I. Selection criteria for drug-like compounds. Med. Res. Rev. 23, 302–321 (2003).
Article CAS Google Scholar
Lajiness, M.S., Vieth, M. & Erickson, J. Molecular properties that influence oral drug-like behavior. Curr. Opin. Drug Discov. Devel. 7, 470–477 (2004).
CAS PubMed Google Scholar
Stockwell, B.R. Chemical genetics: ligand-based discovery of gene function. Nat. Rev. Genet. 1, 116–125 (2000).
Article CAS Google Scholar
Austin, C.P., Brady, L.S., Insel, T.R. & Collins, F.S. NIH Molecular Libraries Initiative. Science 306, 1138–1139 (2004).
Article CAS Google Scholar
Schuffenhauer, A. et al. An ontology for pharmaceutical ligands and its applications for in silico screening and library design. J. Chem. Inf. Comput. Sci. 42, 947–955 (2002).
Article CAS Google Scholar
Feldman, H.J., Dumontier, M., Ling, S., Haider, N. & Hogue, C.W. CO: A chemical ontology for identification of functional groups and semantic comparison of small molecules. FEBS Lett. 579, 4685–4691 (2005).
Article CAS Google Scholar
Roth, B.L., Sheffler, D.J. & Kroeze, W.K. Magic shotguns versus magic bullets: selectively non-selective drugs for mood disorders and schizophrenia. Nat. Rev. Drug Discov. 3, 353–359 (2004).
Article CAS Google Scholar
Frantz, S. Drug discovery: playing dirty. Nature 437, 942–943 (2005).
Article CAS Google Scholar
Connolly, T. & Begg, C. Database Systems, A Practical Approach to Design, Implementation and Management., edn. 3 (Addison Wesley, Reading, MA, 2002).
Google Scholar
Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).
Article CAS Google Scholar
R Core Development Team. A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, Vienna, 2005).
Andrews, P.R., Craik, D.J. & Martin, J.L. Functional group contributions to drug-receptor interactions. J. Med. Chem. 27, 1648–1657 (1984).
Article CAS Google Scholar
Hopkins, A.L., Groom, C.R. & Alex, A. Ligand efficiency: a useful metric for lead selection. Drug Discov. Today 9, 430–431 (2004).
Article Google Scholar
Kuntz, I.D., Chen, K., Sharp, K.A. & Kollman, P.A. The maximal affinity of ligands. Proc. Natl. Acad. Sci. USA 96, 9997–10002 (1999).
Article CAS Google Scholar

Download references

Acknowledgements

We want to thank an unknown referee for very helpful comments and suggestions. Thanks to Federica Massagrande, Emma Williamson, Sid Martin, Phil Brain, Bryn Williams-Jones, Jens Loesel, Mark Gardner, Nigel Wilkinson, Steve Pimblett, Giles Ratcliffe, Jerry Lanfear, Carolyn Barker, Tony Wood, Frank Burslem and Colin Groom. In particular, we would like to thank John Overington, Bissan Al-Lazikani, John Bradshaw and Yosi Taitz. Thanks to Alan Newton and the PGRDi Innovation Fund for financial support.

Author information

Gaia V Paolini and Andrew L Hopkins: These authors contributed equally to this work.

Authors and Affiliations

The Department of Knowledge Discovery, Pfizer Global Research and Development, Sandwich, CT13 9NJ, Kent, UK
Gaia V Paolini, Richard H B Shapland & Andrew L Hopkins
The Department of Computational Chemistry, Pfizer Global Research and Development, Sandwich, CT13 9NJ, Kent, UK
Willem P van Hoorn
The Department of Medicinal Informatics, Structure and Design, Pfizer Global Research and Development, Sandwich, CT13 9NJ, Kent, UK
Gaia V Paolini, Willem P van Hoorn, Jonathan S Mason & Andrew L Hopkins
The Department of Research Informatics, Pfizer Global Research and Development, Sandwich, CT13 9NJ, Kent, UK
Richard H B Shapland
Servefile Software Ltd., Nailsea, Bristol, BS48 4SG, North Somerset, UK
Richard H B Shapland
Lundbeck Research, Ottiliavej 9, DK-2500 Valby, Copenhagen, Denmark
Jonathan S Mason

Authors

Gaia V Paolini
View author publications
You can also search for this author in PubMed Google Scholar
Richard H B Shapland
View author publications
You can also search for this author in PubMed Google Scholar
Willem P van Hoorn
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan S Mason
View author publications
You can also search for this author in PubMed Google Scholar
Andrew L Hopkins
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

G.V.P., database design and production and knowledge discovery; R.H.B.S., database design and production; W.P.v.H., predictive modeling; J.S.M., chemical representation; A.L.H., database design and knowledge discovery.

Corresponding author

Correspondence to Andrew L Hopkins.

Ethics declarations

Competing interests

All of the authors were or are employees (or contract employees) of Pfizer.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Paolini, G., Shapland, R., van Hoorn, W. et al. Global mapping of pharmacological space. Nat Biotechnol 24, 805–815 (2006). https://doi.org/10.1038/nbt1228

Download citation

Published: 13 July 2006
Issue Date: 01 July 2006
DOI: https://doi.org/10.1038/nbt1228

This article is cited by

A novel in silico scaffold-hopping method for drug repositioning in rare and intractable diseases
- Mao Tanabe
- Ryuichi Sakate
- Tomonori Kimura
Scientific Reports (2023)
Opportunities and challenges in application of artificial intelligence in pharmacology
- Mandeep Kumar
- T. P. Nhung Nguyen
- Puneet Kumar
Pharmacological Reports (2023)
The emerging role of mass spectrometry-based proteomics in drug discovery
- Felix Meissner
- Jennifer Geddes-McAlister
- Marcus Bantscheff
Nature Reviews Drug Discovery (2022)
Repurposing Cardio-Metabolic Drugs to Fight Covid19
- Allegra Battistoni
- Massimo Volpe
High Blood Pressure & Cardiovascular Prevention (2021)
The kinase polypharmacology landscape of clinical PARP inhibitors
- Albert A. Antolin
- Malaka Ameratunga
- Bissan Al-Lazikani
Scientific Reports (2020)

Abstract

Similar content being viewed by others

Main

Pharmacological target space

Polypharmacology

Bayesian predictions of pharmacology

Relationship between molecular properties and target class

Industrial trends of compounds, targets and attrition

Degrees of druggability

Discussion

Methods

Database and data model.

Extraction, transformation and loading (ETL).

Data access.

Preparation and analysis of chemical structures

Analysis of biological activity results.

Polypharmacology interaction network.

Calculation of promiscuity indices.

Bayesian model building.

Linear discriminant analysis.

Distances in reduced chemical space.

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links