Main

The foundation for developing drug discovery into a knowledge-based predictive science lies, in part, in the assembly and integration of all medicinal chemistry structure-activity information1. Although access to protein sequence data is widely available through global genome repositories, no such integrated databanks exist for medicinal chemistry structure-activity data. Public initiatives, such as the Harvard University (Cambridge, MA, USA) ChemBank Initiative2, the US National Cancer Institute (Bethesda, MD, USA) Screening Database3 and the US National Institute of Mental Health's (Bethesda, MD, USA) Psychoactive Drug Screening Program Ki Database4, are important developments toward disseminating SAR data. However, most pharmacological data exists in proprietary screening databases, published documents, such as journal articles and patents, and a growing variety of commercial databases. The lack of accepted data standards and data integration thus prevents knowledge discovery and data-mining efforts from learning from the output of the significant annual private and public investment in pharmaceutical research.

To navigate chemogenomic knowledge space, we have created a comprehensive assembly of annotated pharmacological data3,4,5,6,7,8. We have also designed a unified data model to enable the global mapping and measurement of pharmacological space (that is, biologically active chemical space) by the integration of diverse data sources into a single data warehouse. Although a possible alternative to this would be a federated approach, we found that a single database model better fitted with our data-integration vision as well as with our practical, architectural and technical constraints. We applied the principle of knowledge discovery in databases to the design9,10, including data conversion, cleaning and transformation. We found that having all the data in one place offers greater control for entity indexing and data retrieval and management, enabling us to perform global mapping. Ultimately, we believe that the implementation, although important, is a separate issue and it is the integration concept and the data model, however physically realized, that matter. The data are integrated by chemical structure, using unique canonical representations, including the often-neglected issue of tautomers. Assay data are assigned to targets by protein sequence, and indications indexed by a disease code. Thus, both chemoinformatics and bioinformatics techniques can be applied directly to the data-mining of the integrated data set.

At present, the data warehouse contains 4.8-million nonredundant chemical structures, over 275,000 of which are classified as biologically active. Over 600,000 SARs of molecular binding (e.g., IC50; inhibitor concentration required for 50% inhibition of the normal reaction) data from Pfizer's internal screening files are integrated with commercial screening data, competitive intelligence on approved and investigational drugs and key components of the past 25 years of published medicinal chemistry data.

Pharmacological target space

Large-scale data integration of proprietary and published screening data enables the identification of the number of unique molecular targets, as represented by protein sequences, for which chemical tools, leads or drugs have been discovered. Because of the lack of integrated knowledge bases in pharmaceutical research, the list of molecular targets for which small-molecule chemical matter has been discovered has been difficult to ascertain11,12,13,14. We have assigned 2,876 targets to protein sequences from 55 organisms, with biologically active chemical tools for 1,306 proteins. However, because of orthologs among species, many of the mammalian genes are redundant.

In total, we can unambiguously identify 836 genes in the human genome for which small-molecule chemical tools have been discovered (the threshold of biological activity is defined throughout as a binding affinity <10 μM). When Lipinski's rule-of-five criteria for oral drug absorption15 are applied, 727 human targets have at least one compound with a binding affinity <10 μM and 529 human targets have at least one compound with a binding affinity <100 nM that satisfy the rule-of-five (Table 1). Of the pharmacological targets selected, 158 human proteins have been identified as the primary modes-of-action for approved small-molecule drug targets with oral small-molecule drugs primarily targeting only 141 human proteins.

Table 1 Pharmacological target spacea

Polypharmacology

A key question in global pharmacological space is how extensive is promiscuity, which is defined as the specific binding of a chemical to more than one target. Considering each pair of targets in turn, if two proteins both bind to the same ligand, they can be considered as interacting in chemical space, even if they have no other interaction in physical space or similarity in sequence space. The concept of 'target-hopping,' where chemical matter for one target can be considered as the basis for leads or tools for another target has historically been an extremely fruitful method of drug discovery16,17,18. The entire database was analyzed to ensure that nonspecific aggregation inhibitors19 did not bias the results. Of all the 276,122 active compounds found in our database, 65% have recorded activity for one target, whereas 35% are observed to hit more than one target.

We have mapped the observed polypharmacology interaction network for human proteins (Fig. 1) to navigate polypharmacology relationships between targets. Each node of the network is a human target for which we found active lead matter. Two nodes are connected if they share active matter. The strength of this connection (Pij) is defined in the Methods section. Calculation of the polypharmacology network enables the visualization of the interactions between proteins in chemical space. The entire protein interaction network for human proteins, calculated from our database, consists of 700 proteins (nodes) connected by 12,119 interactions (edges) for all compounds below the affinity threshold of 10 μM and with a difference in affinity of up to three orders of magnitude between two targets. Interestingly, the structure of the network is robust to changes in the window of fold-differences in affinity; 696 proteins (nodes) are connected by 11,591 interactions (edges) for all compounds with an affinity threshold <10 μM that have a difference in affinity of up to two orders of magnitude between two targets, and 675 proteins (nodes) are connected by 10,016 interactions (edges) for all compounds with an affinity threshold <10 μM that have a difference in affinity of up to one order of magnitude between two targets. We should stress, however, that the SAR matrix is far from complete, and new data becoming available could alter the appearance of the network, as noted by Vieth et al.20.

Figure 1: Human polypharmacology interaction network representing relationships between proteins in chemical space.
figure 1

Two proteins are deemed interacting in chemical space (joined by an edge) if both bind one or more compound within a defined difference in binding energy threshold (n = 3 in this plot, see Methods section). The number of proteins in this network is 486 (nodes), with 3,636 polypharmacology relationships (edges), where the Pij > 0.1 (Pij is defined in the Methods section), the number of shared compounds between two proteins is >1 and the number of cotested compounds for two targets is Nijtested > 10. Nodes are colored by gene family.

Promiscuity can be considered from the perspective of both the compound and the pharmacological target, to measure compound selectivity and target overlap20,21,22. We evaluated the degree of promiscuity of each target in three different ways (see Methods section for definitions). Table 2 shows the top ten promiscuous targets obtained using the different methods. Method one (P1) consists of calculating a target's promiscuity as the proportion of ligands shared with other targets, multiplied by the average number of targets that each of the target's ligands is active against. This definition promotes targets whose ligands are predominantly promiscuous, with a high number of other targets. The second method (P2) uses the polypharmacology network. This promiscuity index is calculated by counting the number of connections of each target (edges connected to each node in the network). This definition promotes targets that are connected to a large number of other targets, regardless of the strength of the interaction. The third definition (P3) again uses the polypharmacology network, but this time the strength of the connections (Pij) is used in the summation. It is apparent that the different definitions of promiscuity highlight different effects, although the same target classes (aminergic G protein–coupled receptors (GPCRs), cytochrome P450s and protein kinases) appear at the top positions (Table 2). By comparing the rankings of targets resulting from using P1, P2 and P3, we find that P1 is correlated with neither P2 nor P3 (R < 0.5) whereas P2 and P3 are strongly correlated (R = 0.9). This is consistent with the fact that P2 and P3 are calculated using the same network, but also supports the view that connectivity, regardless of the relative strengths of the connections, is the important ingredient in the structure of the polypharmacology interaction network.

Table 2 Most promiscuous human proteins calculated using P1, P2 and P3 promiscuity indexesa

The majority of compounds are active against targets within the same gene family. However, as we observed from the structure of the polypharmacology interaction network, there is significant interaction between gene families. A quarter of all the promiscuous compounds have been observed to be active across different gene families. To visualize the polypharmacology interactions at gene-family level, we have summarized the target-target interaction network by summing all the Pij values by gene family (Gkl, see Methods sections). The resulting matrix is shown in Figure 2 (see data supplied in Supplementary Table 1 online). Using this matrix, we can illustrate the cumulative strength of intra- as well as inter-gene family connections, the latter represented by the off-diagonal cells. Aminergic GPCRs and protein kinases exhibit the greatest intra- as well as inter-gene family promiscuity.

Figure 2: Degree of intra- and inter-gene family promiscuity illustrated as a polypharmacology interaction matrix.
figure 2

The degree of promiscuity, as measured by Gkl is color-coded. White cells represent lack of information. The number of target pairs used in the summation and the level of compound statistics are shown in Supplementary Table 1 online.

Bayesian predictions of pharmacology

We decided to investigate the construction of a virtual array of predictive pharmacology models derived from the analysis of the large-scale integrated SAR data. Using a Laplacian-modified Bayesian classifier approach23,24, 698 target-specific predictive models were built. All the compounds classified as biologically active in the database were filtered by chemical quality criteria. Of the remaining compounds, 10% were removed for the test set (23,792 compounds with 55,781 measurements) and 90% of the data (214,128 compounds with 561,913 measurements) were used to build the predictive models. The Bayesian model for each target was built using the training set where all compounds are classified as either active (endpoint < 10 μM for that target) or inactive (the rest). A Bayesian model prediction is a number describing confidence of activity: the larger the score, the more confidence the compound is active, but no quantitative prediction of affinity is made. Similarly, a large negative score indicates high confidence of inactivity, and finally, a score close to zero is a neutral prediction. Bayesian prediction scores for all test set compounds were calculated across the bank of 698 models.

The success rates of the combined predictive models above the random baseline prediction are shown in Figure 3 (see data in Supplementary Table 2 online). All Bayesian scores greater than or equal to the cutoff are interpreted as predictions of activity. For example, at the confidence score cutoff of 50, 72% of compounds in the test set have at least one prediction, and 64% have at least one correctly predicted target in common with an experimental target, whereas only 4% were incorrectly predicted. At the Bayesian score of 50, a total of 58,428 biological activities are predicted, 56.7% of which are correct, representing a 153-fold enrichment over random. The predicted false-negative rate is 13%, whereas 26,828 false positives are apparently predicted. As the measured ligand-target matrix is only 0.4% full, many of the false-positive predictions may indeed still be true.

Figure 3: Bayesian predictions of pharmacology.
figure 3

Relationship between Bayesian confidence levels and number of predictions from 23,792 compounds with 55,781 measured biological activities across 698 activity models.

In addition to predicting primary pharmacology, we wanted to ascertain whether the models could be used to predict polypharmacology. To explore this problem, we have done a preliminary investigation with Cerep's (Paris) 'BioPrint' data set, which is a nearly complete matrix of measured activities of 997 compounds against 316 targets. Results of these studies can be found in the Supplementary Figures 1 and 2 online and Supplementary Tables 3 and 4 online. These initial studies indicate that probabilistic models built from integrated medicinal chemistry SAR data are a promising approach for predicting primary pharmacology across a large number of protein targets. In terms of polypharmacology, intra-gene family promiscuity is predicted with the highest confidence. Inter-gene family interactions are a much harder problem because of the sparse nature of the ligand-target matrix.

Relationship between molecular properties and target class

We calculated a set of physicochemical descriptors for all compounds in the database to investigate the relationship between target class and the physicochemical properties of ligands13,25. The protein sequences assigned to each of the pharmacological targets were classified into gene families. Distinct differences in the distribution of molecular properties between sets of compounds active against different gene families are observed (Table 3, Fig. 4 and Supplementary Fig. 3 online). For example, the mean molecular weight (MW) of ligands binding to aminergic GPCRs is 378 Da (s.d. = 93 Da), whereas the mean MW of peptide GPCR ligands is greater at 514 Da, but with a wider spread (s.d. = 202 Da). Ligands for the nuclear hormone receptors are the most lipophilic, as measured by calculated octanol/water partition coefficient (clogP), mirroring the properties of steroids. Overall, the properties of the synthetic ligands reflect the differences in the properties of the endo-genous ligands for each target class.

Table 3 Molecular properties of gene family ligands
Figure 4: Molecular weight (MW) distribution of compounds by gene family.
figure 4

See Supplementary Figure 3 for property of MW, cLogP, number of rotatable bonds and number of hydrogen-bond acceptors, by gene family.

The distribution patterns illustrate that, although there are distinctions in the physical properties of the ligands, using a single property to discern separate gene families is too crude. We wanted to investigate whether ligands for specific gene families may be selected within a range of property parameters. Using a 184,687-compound subset of the data as a training set, linear discriminant analysis was used to classify 41,823 compounds by target class using only the calculated physicochemical molecular properties. The data set used for the linear discriminant analysis consisted of the subset of compounds that bind to members of exactly one target class. Overall, this simplistic method successfully classified 34% of ligands to their respective target classes, with an overall enrichment ratio over random of 6.9 (Supplementary Table 5 online). The results are interesting as they suggest that simple calculated molecular properties can be used as a crude classifier of a compound's biological activity, by gene family.

Industrial trends of compounds, targets and attrition

We have witnessed a remarkable growth in the number of reported targets and compounds disclosed in the medicinal chemistry literature, mirroring the rise in investment in pharmaceutical research. In recent years, the number of targets screened, including selectivity counter-screens, published in the medicinal chemistry literature, has been growing drastically. Screening data on nearly 900 proteins are currently published each year, of which >500 molecular targets are reported with potent chemical matter (that is, IC50 < 100 nM). Currently, potent novel chemical tools and leads are first disclosed for 80–100 new molecular targets each year (Fig. 5a). No doubt, this is a conservative estimate as many new compounds and targets are only disclosed in patents, which are not included in this initial literature analysis. The increase in the rate of discovery of chemical tools for new targets doubled from an average of 30 new targets with leads being disclosed in the 1980s to an average of 60 new targets per year in the 1990s. In comparison, an average of four new targets, for first-in-class drugs, have reached the market each year during the 1990s13.

Figure 5: Trends in medicinal chemistry of compounds in the database.
figure 5

(a) Targets and new targets disclosed in the literature per year. (b) Changes in pharmaceutical industry's target portfolio, over time as derived from the published literature. (c) Increase in molecular weight over time of published compounds. (d) Changes in relative median MW among aminergic GPCRs, peptide GPCRs and all compounds through subsequent stages of clinical development. The number of launched drugs is the world wide approved count, irrespective of route of administration. The number of compounds in clinical stages are as follows. Total number of drugs in Phase I, 930; Phase II, 1,248; Phase III, 389; Approved, 1,631. Number of aminergic GPCR drugs in Phase I, 83; Phase II; 136; Phase III; 41; Approved, 185. Number of peptide GPCR drugs in Phase I, 53; Phase II, 100; Phase III, 17; Approved, 35. t-tests between aminergic GPCRs and peptide GPCRs indicate that the inter-gene family differences in MW are statistically significant with probability that the difference is due to chance P < 0.0001, as is the overall decline in MW of aminergic GCPRs and all compounds between preclinical and approved phases. The decline of peptide GPCRs between these two phases is significant with P 0.009.

That said, we have yet to see the increase in new targets with leads translating into a proportionate increase in the number of approved first-in-class drugs. An analysis of the targets of published compounds reveals some significant trends in the changing character of the industry's portfolio of targets and target classes (Fig. 5b), such as a relative decline in proportion of aminergic GPCRs in the industry's target portfolio and an increase in protein kinases.

Over the past 25 years, there has been a steady, inexorable rise in the median MW of reported medicinal chemistry compounds (Fig. 5c). Comparing 5-year averages from 1986–1990 to those of 1999–2003, the median MW of all reported medicinal chemistry compounds in the literature rose 68 Da (20%) from 354 Da to 422 Da, respectively. Interestingly, this growth is also reflected in the increase of the median MW of disclosed ligands for several gene families. For example, compounds binding to aminergic GPCRs have increased in MW by around 56 Da, from 337 Da to 393 Da between the two 5-year periods. No significant increase in mean or median potency is observed in the data to explain the increase in MW. Even so, this rise in MW contrasts with the steady state of the mean MW of approved drugs26 and the steady decline in MW through each subsequent stage of clinical development and increase in the proportion of compounds that are rule-of-five compliant27,28 (Fig. 5d).

Of course, these calculations combine all target classes together; in contrast, the industry's target portfolio is unlikely to be in a steady state, with some target classes emerging and others declining in popularity. The relative difference in molecular properties among the gene families is also reflected in compounds in clinical development; however, again we notice that, even within a gene family, the median MW of compounds surviving subsequent clinical phases is declining slightly (Fig. 5d).

Degrees of druggability

A key objective of our global analysis of pharmacological space is to build the foundation of probabilistic approaches to drug discovery. Trends from marketed and investigational drugs indicate that oral drug space is limited by the biophysical barriers to absorption and permeability in the human body15,26,29,30,31,32,33,34,35,36,37,38,39,40. Because we have observed that the molecular properties of ligands are correlated with their target class, it follows that we should be able to identify those targets with a higher probability to produce drug-like chemical matter. Rather than considering target druggability as a binary state, it can be thought of as a probabilistic continuum, where two targets may both be classified as druggable13 but may exhibit considerable differences in their probabilities of success.

Lipinski introduced the concept of upper physicochemical property limits, above which drug permeability and absorption are less likely15. Like Lipinski, we use the simple molecular properties of clogP, number of hydrogen-bond acceptors (H-acc) and number of hydrogen-bond donors (H-don) as the dimensions of a reduced chemical space. Oral drugs are still the primary focus of pharmaceutical research; therefore, we calculated the properties of 617 approved oral drugs in the reduced chemical space (for which we calculated a centroid at MW = 316, clogP = 2.3, H-acc = 4 and H-don = 2). Figure 6a illustrates the population distribution of oral drugs in two-dimensional molecular property space as an interpolated contour map.

Figure 6: Chemical space of drugs and leads.
figure 6

(a) Molecular property space of oral drugs represented as an interpolated contour matrix map of MW and clogP molecular property space occupied by 615 oral drugs using a 1,000 Da cutoff in MW and (−10,10) cutoff in clogP. The map is color-coded by the number of compounds occupying each cell (MW = 25 Da × clogP = 0.5). The matrix contour interpolation was drawn using R49 (Supplementary Data on oral drugs online). (b) Scatter plot of the median MW and median clogP for all compounds potently active (<100 nM) against oral human drug targets, for which SAR data was identified. (c) Scatter plot of the median MW and median clogP for all compounds potently active (<100 nM measured activity) against human drug targets with chemical tools or drugs.

In terms of drug targets Figure 6b shows the distribution of median molecular properties for all compounds for each of the human oral-drug targets. For each target, the molecular properties are averaged over all its potent active compounds (<100 nM), including oral drugs and leads. Figure 6c shows the same quantities, this time for all human targets with potent active compounds. Comparison of these two figures shows that a significant number of targets are outside the rule-of-five boundaries.

Given the set of active compounds observed for a target, could the ligand properties in reduced chemical space provide a guide to quantifying the likelihood of the target to produce an oral drug? As a first approximation, the degree of druggability of the target can be described as the distance DT between the target T and the oral drugs, in reduced chemical space. This distance is expressed as a function of the deviation of the centroid of each target from the ideal value of the oral-drugs distribution (see Methods section). The resulting distance ranges from 0 to 1, with ideal value being 0. If we compare the results for all human targets (excluding known drug targets) versus human oral drug targets, we observe an enrichment in the degree of druggability of drug targets versus all the remaining human targets. We find that 87% of human oral-drug targets have DT ≤ 0.6, and 65% have DT ≤ 0.4. Of the remaining human targets, 68% have DT ≤ 0.6, and 39% have DT ≤ 0.4. This means that 200 of the remaining targets have a relatively high degree of druggability (DT ≤ 0.4), but have yet to realize this potential.

Discussion

The large-scale integration of medicinal chemistry and pharmacological data enables for the first time the global surveying and navigation of the biologically active chemical space (pharmacological space). Our initial investigations illustrate how the pharmacological target space of potential drug targets is a function of the physicochemical property filters applied to the ligands13,25. The number of proteins for which chemical tools has been identified is significantly higher than previous estimates11,12,13. The compilation and dissemination of chemical tools identified in a global survey, such as this, could be the basis of a rich chemical toolbox for chemogenomics7,25,41,42, providing that the proper legal safeguards and respect for intellectual property are observed.

The comprehensive cataloging of biologically active chemicals also fosters the development of systematic ontologies for pharmacology and medicinal chemistry43,44. The concept of relating proteins in chemical space by polypharmacology interactions provides the foundation for a ligand-based protein classification and valuable resource for understanding the molecular basis for compound promiscuity5,45,46. Our initial focus has been on in vitro binding and selectivity data. Although molecular data provide us with invaluable insights into molecular recognition, ultimately they need to be integrated with gene expression and phenotypic end-points from in vivo and clinical observations if we are to capture the relationships between molecular binding across the proteome with efficacy or toxicity.

In our opinion two interesting potential applications of this work are polypharmacology and probabilistic modeling. The mapping of polypharmacology networks enables us to start considering the rational design of selectively promiscuous agents, thereby expanding the opportunity space for new medicines. Approaching drug discovery as a probabilistic enterprise based on a priori knowledge with an understanding of the varying degrees of druggability, promiscuity and attrition risks may be a significant advance in attempting to increase research productivity. As the vast majority of all drug discovery projects and clinical candidates fail the exacting criteria for safe human medicines, what we are left with are the learning and data that can contribute to the refinement of predictive models, for the benefit of all. Realization of the importance of the integration of our accumulated data can provide the basis for a significant improvement in our knowledge of success factors in the drug discovery enterprise.

Methods

Database and data model.

Our physical database consists of a single central Oracle 9.2 data warehouse. We store chemical structures as Simplified Molecular Input Line Entry Specification (SMILES) strings (http://www.daylight.com and we use the Daylight DayCart Oracle Cartridge (http://www.daylight.com/) for structure indexing and manipulation. We chose SMILES as a database-friendly representation, as it is a compact, simple character syntax, encoding a self-contained language with its own controlled vocabulary and enabling unique canonical representations of structures, in which stereochemical descriptions can easily be defined or relaxed when querying structural data. Our data model is fully normalized to avoid bias toward specific data queries. It is chemo-centric, in that we use chemical structures as the key to information storage and retrieval. This means that all the different entity types are ultimately connected to chemical structures. We are aware that methods for chemical representation are not fully mature and sometimes subjective. Consider for instance the perception of tautomeric equivalence (for example, the two unsubstituted nitrogens in an imidazole ring, one of which nominally needs to have a hydrogen atom attached) and tautomeric relations (where one tautomer may be considered more stable and thus the preferred drawing form or indeed where the tautomeric forms are considered to require chemical transformation). As a consequence, related software and rules are likely to change, expand and improve with time. For this reason we have designed the data model to handle multiple concurrent representations (that is, multiple SMILES strings) for any given compound. This way a fully flexible view of chemical structures and their connections can be achieved. We produced a single unified data warehouse integrating, by chemical structure, protein sequence and indication, the Pfizer's structure-activity data (e.g., IC50, EC50 (concentration of a compound where 50% of its effect is observed), Ki, Kd, excluding high-throughput screening percentage inhibition data), which contains data from legacy Pfizer (New York), Warner-Lambert (formerly of Morris Plains, NJ, USA, now part of Pfizer) and Pharmacia (formerly of Kalamazoo, MI, USA now part of Pfizer); the Inpharmatica (London) StARLITe database, which contains data extracted from Journal of Medicinal Chemistry (issues January 1980–Sept 2004) and Bioorganic Medicinal Chemistry Letter (issues January 1991–September 2004); the Cerep BioPrint database5 and summary data from the Thomson (New York) Current Drugs Investigational Drugs Database (IDDB) (http://scientific.thomson.com/products/iddb/). The current database contains 4.8-million unique chemical structures with protein identifiers and sequences assigned to 2,876 targets with assay measurements; 526,548 assay measurements are related to 276,122 active chemical structures.

Extraction, transformation and loading (ETL).

Before being fully integrated into our database, the original data sources were first loaded into Oracle staging tables. This was achieved using a combination of tools (Servefile's Java-based data loader, Oracle, Pipeline Pilot). The staging tables were processed to perform data selection, cleaning, mapping and standardization. This ETL procedure is the most critical and time-consuming part of knowledge discovery in databases involving a blend of disciplines, namely scientific-domain expertise, logic and informatics. Data fields from the different data sources were identified and selected. Metadata tables were created to map together different conceptualizations of the same entities (that is, different ontologies). Data quality issues, ranging from spelling mistakes to entity misassignment, were addressed and contained or flagged. Chemical structures were standardized at different levels depending on the chosen representation. Wherever practical, all entities that could be enumerated (e.g., units of measure, country codes) were mapped to controlled vocabularies. At the end of this process, data were fully integrated at a scientific level for data mining. Diseases were mapped onto a disease taxonomy derived from the Medical Dictionary for Regulatory Activities (MedDRA). Protein sequences were directly mapped to assays in all cases where the protein could be unambiguously identified.

Data access.

To be of practical use, a data repository needs to be easily accessible. This requirement is at odds with the principles of data normalization47 and flexibility of representation. It is therefore customary to separate the data warehouse (and data-loading activities) from access layers (data retrieval). The latter are usually data marts, sets of database tables where data are regrouped in a different way, optimized to answer specific questions. The advantage of having data marts is that queries are prepackaged and therefore faster. The drawback is that data must be copied from the data store to the data marts. This causes additional issues such as disk space shortage and scheduling of data updates and downtime. We believe that data marts are the right solution where the most common queries are already known and routinely performed. Because our database system was still highly experimental, and the number of questions we wanted to ask very high, we designed an alternative approach. We built a set of components (using Scitegic's (San Diego, CA, USA) Pipeline Pilot 4.5; http://www.scitegic.com/) to query, manipulate and filter the data. The lower-level components could be combined, and results from a query could be refined and/or fed into subsequent queries, generating sets of hit lists. This approach offered two advantages. The first was to perform an experimental benchmarking of the database, to find where data marts would be mostly needed to improve performance, and how the most commonly asked questions could be identified, grouped together and packaged. The second was to offer a great flexibility in interrogating the database, allowing us to cross-link the different entities in every possible way. We found that the performance drawback was acceptable for a system at this stage of maturity, mostly used for statistical analysis and post-processing, rather than for fast online data retrieval.

Preparation and analysis of chemical structures

. In the study described here, all chemical structures were standardized using DayCart 4.82. A further processing step, to remove inconsistencies and identify salts and mixtures, was performed using a Pipeline Pilot protocol written in house. For the purpose of this study, all salts were stripped off the structures and the canonical tautomer of each resulting structure was identified using a standard Pipeline Pilot component. The resulting desalted canonical tautomers were loaded onto the database and used for structure matching. Molecular properties were either stored or calculated on the fly using standard Pipeline Pilot components.

Analysis of biological activity results.

N-point results (e.g., IC50, EC50, Ki and Kd) were collected for all the molecular targets that we mapped to gene sequences. Biological assays related to more than one gene (where the particular target could not be identified or where more than one target was involved) were kept separate. The analysis here refers to the cases where a given assay was related to a single gene. The active compounds were selected among the compounds where the best resulting activity (combining all the N-point measurement types) was found to be <10 μM. This is our definition of active compounds throughout the paper. Outliers in the biological activity results were identified with a simple automated protocol based on calculating the average distance:

of each result in a set from all other results and flagging the ones whereDi−Dmin > 1. Here the symbols i and j refer to results from different assay experiments for the same compound and target. Dmin is the minimum distance among all pairs of these results. The flagged sets were then manually checked and the outliers removed from the analysis.

Polypharmacology interaction network.

The strength of polypharmacology interactions (Pij) between two targets i and j was calculated, for all active compounds in the database, as follows:

where Nijtested is the number of compounds commonly tested against target i and j. Nij is the number of compounds observed to bind to both targets i and j below the compound promiscuity threshold; a compound is considered shared between targets i and j if there is less than an n log difference in potency (where n = 1 is a tenfold difference in potency, n = 2 is a 100-fold difference in potency, n = 3 is a 1,000-fold difference in potency).

Each log order difference in potency represents a binding energy difference of ΔΔG = −1.4 kcal/mol. We used Cytoscape48 (http://www.cytoscape.org/) to display the interaction network in Figure 1 for n = 3. The cumulative effect of polypharmacology interactions between different targets of the same or different gene families is represented by the elements of the summarized matrix in Figure 2, calculated as

Only the cells for which enough statistics were available (Nijtested > 10) were included in the summation. A potency-difference window of n = 1 was used.

Figure 2 was produced using Spotfire (Somerville, MA, USA) Decision Site 7.2 (http://www.spotfire.com/).

Calculation of promiscuity indices.

We evaluated the promiscuity of a target T in three different ways (P1, P2 and P3).

The first index was defined as follows:

with

where Nactives(T) is the number of active compounds of target T and Ntotalshared(T) is the number of active compounds of target T for which the compound promiscuity index (PC) > 1. (PC of a compound (C) is defined as the total number of targets that the compound is active against.)

The second index is

where ITj is a matrix identical to the polypharmacology matrix, with all the values where Pij is nonzero substituted by ones.

The third index was calculated by summing along rows of the matrix itself as

Bayesian model building.

Compounds were filtered to remove structures with MW >1,000 Da and those that failed structural quality filters (e.g., toxicophores, aggregation inhibitors, reactive groups). After the filtering, there were 617,694 experimental activities from 238,655 compounds covering 698 targets. Protein targets with fewer than ten biologically active compounds after filtering were also removed from the data set. Compound structures were transformed into FCFP_6 functional-class fingerprints. Data preparation, Bayesian analysis and model building were implemented using the Scitegic Pipeline Pilot Laplacian-corrected Bayesian classifier23,24 algorithm. This implementation of Bayesian statistics uses information from both the active and inactive compounds from the training set and removes features from the model, which are deemed not to be important.

Linear discriminant analysis.

The biologically active compounds were filtered by chemical quality criteria to remove aggregation inhibitors and compounds with potentially reactive groups. Compounds active against more than one gene family were also removed. Of the remaining compounds, 184,687 were selected as a training set for the linear discriminant analysis (as implemented in R49) to classify the gene family activity on a test set of 41,823 compounds. The classification was based on the following calculated molecular properties: MW, number of hydrogen-bond acceptors, number of hydrogen-bond donors, number of rotatable bonds, molecular surface area, molecular polar surface area, number of ionizable centers, clogP, Andrews' binding energy50 and predicted molecular solubility.

Distances in reduced chemical space.

We have prepared a set of 617 US Food and Drug Administration–approved oral drugs and calculated their MW, clogP, number of hydrogen-bond acceptors (H-acc), number of hydrogen-bond donors (H-don), using standard and in-house Pipeline Pilot components. In these components the H-acc atoms are defined as heteroatoms (oxygen, nitrogen, sulfur or phosphorus) with one or more lone pairs, excluding atoms with positive formal charges, amide and pyrrole-type nitrogens and aromatic oxygen and sulfur atoms in heterocyclic rings. H-don atoms are defined as heteroatoms (oxygen, nitrogen, sulfur or phosphorus) with one or more attached hydrogen atoms. These four properties are used to characterize the set in reduced chemical space. We have collected all the targets which either have potent active compounds (below 100nM) or are drug targets, and calculated the centroid {MWT, clogPT, H-accT, H-donT} for each of these targets. The distance in reduced chemical space is defined for each target T as

where the function fk, k={MW, clogP, H-don, H-acc} represents the distribution of values of the molecular properties of oral drugs, normalized so that fk € [0,1].

Note: Supplementary information is available on the Nature Biotechnology website.