Abstract
Target identification (identifying the correct drug targets for each disease) and target validation (demonstrating the effect of target perturbation on disease biomarkers and disease end-points) are essential steps in drug development. We showed previously that biomarker and disease endpoint associations of single nucleotide polymorphisms (SNPs) in a gene encoding a drug target accurately depict the effect of modifying the same target with a pharmacological agent; others have shown that genomic support for a target is associated with a higher rate of drug development success. To delineate drug development (including repurposing) opportunities arising from this paradigm, we connected complex disease- and biomarker-associated loci from genome wide association studies (GWAS) to an updated set of genes encoding druggable human proteins, to compounds with bioactivity against these targets and, where these were licensed drugs, to clinical indications. We used this set of genes to inform the design of a new genotyping array, to enable druggable genome-wide association studies for drug target selection and validation in human disease.
Introduction
Only 4% of drug development programmes yield licensed drugs (1, 2), largely because of two unresolved systemic flaws: (1) preclinical experiments in cells, tissues and animal models and early phase clinical testing to support drug target identification and validation are poorly predictive of eventual therapeutic efficacy; and (2) definitive evidence on the validity of a new drug target for a disease is delayed until late phase development (in phase II or III randomised controlled trials; RCTs). Reasons for poor reliability of preclinical studies include suboptimal experimental design with infrequent use of randomisation and blinding (3); species differences; inaccuracy of animal models of human disease (4, 5); and over-interpretation of nominally significant experimental results (6–8). Human observational studies can mislead for reasons of confounding and reverse causation. Evidence on target validity from phase I clinical studies can also be inadequate (since phase I studies primarily investigate pharmacokinetics and tolerability, are typically small in size, of short duration and measure a narrow range of surrogate outcomes, often of uncertain relevance to perturbation of the target of interest) (9). Since the target hypothesis advanced by preclinical and early phase clinical studies is all too frequently false, expensive late-stage failure in RCTs from lack of efficacy is a common problem affecting many therapeutic areas (10), posing a threat to the economic sustainability of the current model of drug development.
Genetic studies in human populations imitate the design of an RCT without requiring a drug intervention (11–13). This is because genotype is determined by a random allocation at conception according to Mendel’s second law (Mendelian randomisation - MR) (12, 14). Single nucleotide polymorphisms (SNPs) acting in cis (i.e. variants in or near a gene that associate with the activity or level of the encoded protein) can therefore be used as a tool to deduce the effect of pharmacological action on the same protein in an RCT. Numerous proof of concept examples have now been reported (15, 16, 11, 17, 13, 18, 19), including the striking correlation between the association of 80 circulating metabolites with a SNP in the HMGCR gene that encodes the target for statin drugs, and the effect of statin treatment on the same set of metabolites (20). SNPs acting in cis are a general feature of the human genome (21); and population and patient datasets with stored DNA and genotypes linked to biological phenotypes and disease outcome measures are now widely available for this type of study.
By extension, disease-associated SNPs identified by GWAS could be re-interpreted as an under-utilised source of randomised human evidence to aid drug target identification and validation. For example, loci for type-2 diabetes identified by GWAS include genes encoding targets for the glitazone and sulphonylurea drug classes already used to treat diabetes (22, 23). Apparently sporadic observations such as this suggest that numerous, currently unexploited disease-specific drug targets should exist among the thousands of other loci identified by GWAS and similar high quality genetic association studies. Recent studies of advanced or completed drug development programmes (mostly based on established approaches to target identification) have also indicated that those with incidental genomic support had a higher rate of developmental success (24–27).
Fulfilling the potential of GWAS (and studies using disease-focused genotyping arrays) for drug development requires mapping disease- or biomarker-associated SNPs to genes encoding druggable proteins and to any allied drugs and drug-like compounds. The set of proteins with potential to be modulated by a drug-like small molecule has been predicted on the basis of sequence and structural similarity to the targets of existing drugs, the set of encoding genes being referred to as the druggable genome. Hopkins and Groom identified 130 protein families and domains found in targets of drug-like small molecules known at the time, and over 3000 potentially druggable proteins containing these domains (28). A similar estimate was made by Russ and Lampel, using a later human genome build (29). Kumar et al. utilized these privileged protein families (plus other families of particular relevance to cancer) to manually curate lists of druggable proteins for inclusion in the dGene data set (30). More recently, the Drug-Gene Interaction database (DGIdb) has been developed (31), which integrates data from each of the previous efforts together with a recently compiled list of drug candidates and targets in clinical development (32) as well as information from the PharmGKB (33), Therapeutic Target Database (TTD) (34) and DrugBank (35) databases.
However, earlier estimates of the druggable genome predated contemporary genome builds and gene annotations, and also did not explicitly include the targets of bio-therapeutics, which formed more than a quarter of the 45 new drugs approved by the FDA’s Center for Drug Evaluation and Research in 2015 (36), reflecting their increasing importance in pharmaceutical development. We therefore updated the set of genes comprising the druggable genome. We then linked GWAS findings curated by the National Human Genome Research Institute (NHGRI) and European Molecular Biology Laboratory–European Bioinformatics Institute (EMBL-EBI) GWAS catalog (37) to this updated gene set, and also to encoded proteins and associated drugs or drug-like compounds curated in the ChEMBL (38) and First Databank (39) databases. We used the linkage to explore the potential for genetic associations with complex diseases and traits to inform drug target identification and validation, as well as to repurpose drugs effective in one indication for another. Additionally, to better support future genetic studies for disease-specific drug target identification and validation, we assembled the marker content of a new genotyping array designed for high-density coverage of the druggable genome and compared this focussed array with genotyping arrays previously used in GWAS.
Results
Re-defining the druggable genome
We estimated 4,479 (22%) of the 20,300 protein coding genes annotated in Ensembl v.73, to be drugged or druggable. This adds 2,402 genes to previous estimates made by Hopkins and Groom or Russ and Lampel by inclusion of novel targets of first-in-class drugs licensed since 2005; the targets of drugs currently in late phase clinical development; information on the growing number of pre-clinical phase small molecules with protein binding measurements reported in the ChEMBL database; as well as genes encoding secreted or plasma membrane proteins that form potential targets of monoclonal antibodies and other bio-therapeutics. A set of 680 genes that was included in earlier estimates but not our data set consists mainly of olfactory receptors and phosphatases; both protein families have significant limitations for future exploitation as drug targets (40, 41) (see Figure 1 and Methods section). We stratified the druggable gene set into 3 tiers corresponding to position in the drug-development pipeline. Tier 1 (1,427 genes) included efficacy targets of approved small molecules and biotherapeutic drugs as well as clinical-phase drug candidates. Tier 2 comprised 682 genes encoding targets with known bioactive drug-like small molecule binding partners as well as those with significant sequence similarity to approved drug targets. Tier 3 contained 2,370 genes encoding secreted or extracellular proteins, proteins with more distant similarity to approved drug targets, and members of key druggable gene families not already included in Tiers 1 or 2 (GPCRs, nuclear hormone receptors, ion channels, kinases and phosphodiesterases). A full list of genes is provided in Supplementary File S1. An overview of the 15 most frequently occurring protein domain types for each tier can be found in Supplementary Table 1, based on the Pfam-A database of protein families (see Methods section Pfam-A domain content).
Connecting loci identified by GWAS to the druggable genome
We retrieved 21,406 associations from 2,155 GWAS, of which 9,178 surpassed the significance threshold of p≤5×10-8 (see Methods section). The retrieved associations spanned 315 Medical Subject Heading (MeSH) disease terms, which can be stratified into twenty-four MeSH root disease areas and three MeSH Psychiatry and Psychology areas (Table 1). Variants associated with common diseases and biomarkers had median minor allele frequency 0.29 (interquartile range, IQR 0.21) based on 7,387 GWAS-significant records with risk allele frequency data, reflecting the preponderance of common variants on widely used genotyping arrays. The median odds ratio (OR) for GWAS significant studies of disease end-points was 1.24 (IQR 0.31) (based on 3,367 GWAS significant results with effect size data). We examined sequence ontology consequence types (42) of disease and biomarker-associated variants and found most to be non-coding, mainly intronic, presumably altering or marking variants that alter mRNA expression or availability, or marking variants that alter structure or activity of encoded proteins (Supplementary Figure S1C).
Of the 9,178 GWAS significant associations, 8,879 mapped to 5,084 unique intervals defined as containing all SNPs in linkage disequilibrium (LD) (with an r2 ≥ 0.5) with the SNP exhibiting the most significant association, applying an upper physical bound of 1Mb either side of this variant (see Methods section). The remaining 299 associations were either not in LD with any other variants, or not present in the 1000 genomes phase 3 panel. Such associations were assigned a nominal interval of 2.5kbp either side of the most significantly associated SNP. The frequency distribution of unique genes (and druggable genes) in LD intervals corresponding to unique significant associations were both right skewed (Figure 2), and there was a correlation between LD interval size and the number of resident genes (Supplementary Figure 3).
Of the 5,084 unique LD intervals, 1,533 (30.2%) contained a single gene. Of these, 532 also contained a single gene from the druggable set: 233 from Tier 1, 76 from Tier 2 and 223 from Tier 3. Of the remaining genomic intervals, 17.3% (880) mapped to intervals containing two genes, 10.1% (511) contained three genes 6.7% (343) contained four genes and 25.2% (1281) contained five or more genes. Additionally, 536 (10.5%) of regions had no gene in the LD interval. For the 1624 LD intervals containing two or more genes at least one of which was druggable, the median distance of the closest druggable gene to the reported GWAS variant was 4.98 kbp (IQR 37.7 kbp), where the distance was set to 0 bp for GWAS variants lying within a gene, and a druggable gene was among the two most proximal genes in 67.1 % of these LD intervals (1089) (Figure 3). We identified a total of 3052 genes in the druggable set that were not represented in any of the LD intervals corresponding to a GWAS association; 62.7%, 69.2% and 71.6% of Tier 1,2 and 3 genes respectively.
Linking GWAS associations to licensed drug targets
We found that 1,291 GWAS associations defined 1,072 LD intervals containing 532 druggable genes from Tier 1, which includes the targets of licensed drugs. 479 of the intervals contained a single drug target and 593 contained two or more targets. For the set of LD intervals containing genes encoding the targets of licensed drugs, two clinically qualified curators blinded to the identity of the genes, independently evaluated the correspondence between the disease association from the GWAS and the treatment indication(s) for drug(s) acting on the target(s) encoded by a druggable gene in the interval. The curation process is described in the Methods section. Our curators identified 56 unique associations (30 unique drug targets) where the treatment indication and genetic association were precisely concordant and 13 associations (9 targets) where the indication and association came from the same disease area (e.g. a GWAS in one form of epilepsy identifying a drug target for a different form of epilepsy). 97 associations (mapping to 37 licensed drug targets) corresponded to a biomarker known to be altered by treatment with the corresponding drug (e.g. an LD interval containing the gene encoding the interleukin-6 receptor was identified in a GWAS of C-reactive protein, a biomarker known to be altered by the action of the interleukin-6 receptor blocker, tocilizumab). A further 76 associations (27 licensed drug targets) were identified through a genetic association with a mechanism-based adverse effect, e.g. in a GWAS of heart rate, the SNP rs3143709 defined an LD interval containing the gene ACHE (acetylcholinesterase) encoding the target of cholinesterase inhibitors used in the treatment of myasthenia gravis, which have the side effect of lowering heart rate (43). A further 32 genetic associations (corresponding to 8 targets) were with a quantitative trait that could be either a marker of therapeutic efficacy or a mechanism-based side effect, as in the case of QT interval in the context of anti-arrhythmic drug therapy. In all, GWAS ‘rediscovered’ 74 licenced drug targets through disease indications, mechanism of action or via mechanism-based adverse effects (the numbers for the categories above are non-additive because some targets overlap categories). Illustrative examples of the curation are shown in Table 3.
Manual curation identified 1,523 discordant pairings of drug indications and disease associations, corresponding to 144 drug targets that were interpreted as plausible repurposing opportunities (Figure 4). After manual curation, uncertainty remained for 108 associations (52 targets) as to whether discordance represented a repurposing opportunity, or an unrecognised mechanism-based side effect. The remaining targets of licensed drugs mapped to LD intervals corresponding to GWAS traits unlikely to be of therapeutic interest (e.g. hair colour); or to a genetic association with a novel biomarker of uncertain biological function (e.g. a novel metabolite measured by a new metabolomics platform). Curators disagreed on the coding for GWAS associations corresponding to 4 licensed targets. For LD intervals corresponding to GWAS rediscoveries, the interval length was smaller, contained fewer genes, and the druggable gene was closer to the lead SNP than for those LD intervals where the indication and genetic association were discordant (Supplementary Table S2).
Translational opportunities unveiled by the data linkage
Figure 5 and Supplementary Figures S6 and S7 illustrate the result of mapping disease associations in the GWAS catalogue to the full set of druggable genes, the encoded proteins and allied compounds exhibiting binding affinity to these targets, regardless of development phase. For example, 84 studies in the GWAS catalogue reported findings pertaining to cardiovascular system diseases (39 disease sub categories), reporting 388 GWAS associations, mapping to 228 unique LD intervals containing 670 genes, of which 135 were in the druggable set. Of these, 29 genes were either the solitary occupant or one of only a pair of genes in the LD interval. We linked all 135 druggable genes identified in the cardiovascular category to 19,844 compounds with measured activities in ChEMBL (see Methods section Linking GWAS and drug target data), of which 512 had a United States Adopted Name (USAN) International Non-Proprietary Name (INN) or which were in late phase development, and 168 of which were previously licensed drugs. Based on comparisons between GWAS phenotype terms and treatment indications in the cardiovascular category, 8 drug target indications and genetic associations were concordant (target ‘rediscovery’) and 19 were discordant. Figure 6 illustrates the results of a similar mapping exercise for seven specific diseases (type 2 diabetes, hypertension, inflammatory bowel disease, asthma, coronary heart disease, schizophrenia, and Alzheimer’s disease).
The proportion of druggable genes in LD intervals defined by GWAS SNPs for digestive system diseases (0.20, 95% CI: 0.12-0.27), neoplasms (0.15, 95%CI: 0.10-0.20), nervous system diseases (0.17, 95%CI: 0.10-0.24), cardiovascular diseases (0.20, 95%CI: 0.12-0.29), respiratory diseases (0.19, 95%CI: 0.08-0.31), skin and connective tissue diseases (0.17, 95%CI: 0.10-0.24), immune system diseases (0.19, 95%CI: 0.12-0.26) and mental health (0.16, 95%CI: 0.08-0.24) was similar to the proportion of druggable genes in the genome overall (4479/20,300 = 0.22).
Coverage of the druggable genome by Illumina DrugDev and other widely used genotyping arrays
Capture of variation in druggable genes by the widely used genotyping arrays is illustrated in Figure 7, with reference to the 1000 genome European super population ancestry panels (44). Disease-focused genotyping arrays and whole genome arrays with fewer than 600,000 SNPs used for many of the discoveries curated in the GWAS catalogue provided less comprehensive capture of variation in the druggable genome than the more recently developed arrays with several million SNPs (e.g. the Illumina Human Omni 2.5 Exome 8 and Illumina Omni 5). However, since no array to date has been designed specifically to ensure capture of variation in genes encoding druggable targets, we designed the content for an array (the Illumina DrugDev array) utilising the Illumina Infinium platform, that combines genome-wide tag SNP content of the Illumina Human Core array with 182,375 bespoke markers in 4479 druggable genes (see Methods). The median number of variants captured per kb of the druggable genome was very similar to that of the Illumina Human Omni 2.5 Exome 8 and Illumina Omni 5 (Figure 7 and Supplementary Figures S8 and S9) with an average of around 2.5 SNPs per kbp of the druggable genome, at an average of nearly 50 variants per gene array wide, with even denser coverage of Tier 1 and 2 genes.
All available genotyping arrays captured druggable genome variation most efficiently among European descent populations and most poorly among African descent populations (Figure 7 and Supplementary Figures S8 and S9). Outside of the European populations the high density Illumina Omni arrays gave superior coverage (for both directly genotyped variants and tagged variants) to all other genotyping arrays. The Affymetrix UK Biobank array displayed similar coverage to the Illumina DrugDev array in EUR populations but less complete coverage in non-European populations. A heat map summarising the coverage for each druggable gene, stratified by tier and 1000 genomes population groups, is shown in Figure 8. Results for tagged and directly typed variants in 1000 genomes sub-populations are shown in Supplementary Figure S10.
Discussion
By first re-estimating the boundaries of the druggable genome, and then mapping biomarker and disease associated loci from GWAS to genes encoding druggable targets, we demonstrate the extent to which GWAS have already rediscovered target-disease indications or mechanism-based adverse effects of licensed drugs. These findings indicate the potential of genetic association studies to systematically and accurately identify disease-specific drug targets across the spectrum of human diseases, addressing one of the key productivity limiting steps in drug development.
For example we found substantial potential for repurposing of drugs with licensed indications from one disease area to another (Figure 4), in keeping with previous analyses from the GWAS catalog that indicated that 17% of genes exhibit associations with more than one phenotype (45). We also identified potential to progress or reposition compounds at earlier developmental stages, by mapping drug target loci implicated GWAS to the ChEMBL drug target annotations (Figure 5).
Despite the many novel therapeutic opportunities already arising from the mapping of existing genetic association findings to drug targets and compounds, there are strong reasons to suspect that the potential of this approach has yet to be maximised. Our analysis identified target-disease indication pairings (defined as a gene encoding a druggable target mapping to an LD interval containing a lead SNP from a GWAS) for 1,427 of the 4,479 druggable genes and 240 of the 652 genes encoding targets of licensed drugs. We might not have discovered associations for all genes in our druggable set because targets of drugs in development may truly play no role in any disease. However, alternative explanations are that only a fraction of diseases have been subjected to GWAS (451 out of 3022 conditions (the denominator is based on the number of bottom level MeSH disease areas)); that for many of the diseases that have been investigated by GWAS the sample sizes have been too small to detect all the responsible genes; or that there may have been incomplete coverage of certain druggable genes by the arrays most widely deployed in GWAS.
Genome wide association analyses continue to be published in new disease areas, and in new ethnic groups. Additional genetic discoveries are also being made with other types of array e.g. dense, locus-centric SNP arrays following up on GWAS findings that are currently not systematically captured by the GWAS catalog, eg. Cardiochip (46), CardioMetabochip (47), and Immunochip (48), and by increases in sample size. Exome-arrays analyses are also unveiling rare, disease-associated variants under-represented in whole-genome arrays. Therefore, we anticipate that the current gap between druggable genes and GWAS findings will be reduced over time, particularly if such studies are extended to electronic health record datasets which form rich repositories of phenotypic traits and diagnostic codes.
Genetic profiling of a promising target against a range of outcomes can help evaluate the efficacy and safety of a target for the primary indication as well as the identification of additional disease indications to help plan drug development priorities. In order to stimulate the wider use of genetic association studies in drug development, and to ensure that such studies have comprehensive coverage of the druggable genome, we designed the content of a new array that combines focused coverage of the druggable genome within a whole genome scaffold. This array could be deployed to boost sample size and power in diseases already studied by GWAS to identify additional susceptibility loci and druggable targets. It could also help stimulate new druggable GWAS prioritised according to unmet therapeutic need. This would automatically lead to an abundance of target profiling information encompassing both efficacy and safety outcomes. This will need to be captured systematically, and curated consistently to help develop a repository of human drug targets linked to the predicted consequences of their pharmacological modification.
Some limitations of our analysis are noteworthy. The identification of repurposing opportunities in the current dataset relied on detecting discordance between a gene-disease association and the corresponding target-disease indication for a licensed drug, and excluding instances where this was likely to be due to a mechanism-based adverse effect. However, the lack of standardised vocabulary in licensing agency approval documents, and the scientific literature currently hampers this effort. We therefore used a combination of EFO and MeSH terms to harmonise nomenclature. Two qualified physicians then compared the annotations using a pre-specified classification system developed in a pilot study involving one fifth of the dataset. Greater efforts to harmonise terms both from the different ontologies (e.g. EFO, MeSH terms, the Disease Ontology (DO) and the Human Phenotype Ontology (HPO)) (49–51), as well as from vocabularies for drug indications from the Anatomical Therapeutic Chemical (ATC) classification, electronic BNF and eMC+ terms would help generate standardised terminology to improve the efficiency of similar efforts in the future.
Where several genes occupy the same LD interval as a GWAS SNP, it may be difficult to determine which is causative. We took a pragmatic approach to this problem by classifying LD intervals containing druggable genes according to the total number of genes in the interval and the number and proximity of any druggable gene to the associated SNP. Approximately 529 unique LD intervals containing a variant with a significant association from a GWAS contained a single druggable gene. Such genes are strong positional candidates for the association. For the remainder, the LD interval included 2-146 genes (median 4 genes; excluding the 536 regions containing 0 genes, Figure 3), but a druggable gene was first or next most proximal gene to the association signal in 36.1% of these cases. The rediscovery of 183 target-indication or mechanism-based adverse pairings for licensed drugs using this indicates its validity of this approach. Previous Mendelian randomisation studies also provide reassurance that associations of SNPs in proximity to genes encoding druggable targets recapitulate the effects of drugs modifying the encoded protein pharmacologically (13, 52, 18).
Nevertheless, we recognize that some misclassification is possible, for example when a causal signal arising from a gene encoding non-druggable protein occupies the same LD interval as a gene encoding a druggable target (confounding by linkage disequilibrium). Integrating information from feature annotation databases such as ENCODE (53) NIH Roadmap (54) and the Single Amino Acid Polymorphism Database (SAAP) (55) could help reduce misclassification. Localisation of causal genes could also be aided by evidence on the effect of genetic variants on the RNA transcription, on the activity or concentration of proteins and metabolites, combining new proteomic and metabolomics technologies that are scalable to large population studies (56, 57) with statistical approaches to assess whether association signals from the same region are consistent with the same causal variant (58).
The Mendelian randomisation paradigm that underpins this strategy validates targets (within a defined disease context) and not compounds, although comparing the profile of effects of a genetic variant with those of a drug or developmental compound can help distinguish on- from off-target effects (13, 18). For this reason RCTs will not be superseded by the approach we describe because any new molecule developed for a target of interest could have off-target actions that cannot be modelled genetically. Additionally, the effect of altering the level or function of a target may only be seen beyond some threshold, so that a weak genetic effect may not adequately model the effect of modifying the target pharmacologically (26). Genetic evidence of a causal mechanism also does not guarantee its reversibility through pharmacological modification. For example, immune system related genetic variants associate with the risk of developing type I diabetes, but useful therapies arising from this knowledge may be difficult to realise because by the time the disease is diagnosed, immune mediated damage to the pancreatic beta-cells may be too advanced (26). Despite these theoretical limitations, evidence is emerging that Mendelian randomisation studies have wide-ranging potential to improve the efficiency of drug development and reduce the risk of expensive late-stage failure.
In summary, we have shown an approach to focus and catalyse the use of genomic information to support drug target validation and which can be used to accurately match targets to disease indications and to identify rational repurposing opportunities for licensed drugs. The approach aligns well with proposals to ‘re-engineer’ translational science (59). It could help address the efficiency and innovation problem and could serve as a basis for reinvigorating drug development through new academic-industry partnerships.
Materials and Methods
Assembly of a druggable gene set
The reference set of genes used to redefine the druggable genome comprised gene annotations from Ensembl v.73 with a biotype of ‘protein coding’. To this were added T-cell receptor and immunoglobulin genes, polymorphic pseudogenes, plus a number of additional genes that were annotated in Ensembl v.73 as non-protein coding but which were nevertheless believed to encode important proteins (e.g., SRD5A2, CYP4F8). Data were extracted via Biomart (http://www.ensembl.org/biomart). The content was assembled in three tiers:
Tier 1 - This tier incorporated the targets of approved drugs and drugs in clinical development. Proteins that are targets of approved small molecule and biotherapeutics drugs were identified using manually curated efficacy target information from release 17 of the ChEMBL database (60). An efficacy target was defined as the intended target for the drug as opposed to any other potential targets for which the drug shows high affinity binding. Where binding site information was available in ChEMBL, a non-drug-binding subunit of a protein complex were assigned to Tier 3, whereas the drug-binding subunit was included in Tier 1. Drugs in clinical development were identified from a number of sources: investor pipeline information from a number of large pharmaceutical companies (including Pfizer, Roche, GlaxoSmithKline, Novartis (oncology only), AstraZeneca, Sanofi, Lilly, Merck, Bayer and Johnson & Johnson – accessed June-August 2013) monoclonal antibody candidates and USAN applications from the ChEMBL database (release 17), and drugs in active clinical trials from clinicaltrials.gov (61). Targets for these drug candidates were assigned from company pipeline information and scientific literature, where available. Where no reported target information could be found, a potential target was assigned through analysis of bioactivity data in ChEMBL, with the target having the highest dose-response measurement ≤ 100nM for the compound being assigned. All other human targets having an IC50/EC50/GI50/XC50/AC50/Kd/Ki/potency ≤100nM for an approved drug or USAN compound were also included in Tier1. Genes involved in ADME/drug disposition (phase I and II metabolic enzymes, transporters and modifiers) were identified from the PharmaADME.org extended set (62).
Tier 2 - This tier incorporated proteins closely related to drug targets or with associated drug-like compounds. Proteins closely related to targets of approved drugs were identified through a BLAST search (blastp) of Ensembl peptide sequences against the set of approved drug efficacy targets identified from ChEMBL previously (38). Any genes where one or more Ensembl peptide sequences shared ≥50% identity (over ≥75% of the sequence) with an approved drug target were included. Putative targets with drug-like (Lipinski rule-of-five compliant) compounds having an IC50/EC50/GI50/XC50/AC50/Kd/Ki/potency ≤1µM were identified from ChEMBL and were also included in Tier 2.
Tier 3 - This tier incorporated extracellular proteins and members of key drug-target families. Proteins distantly related to drug targets were identified through a BLAST search against the set of approved drug targets (as above), with any proteins sharing ≥25% identity over ≥75% of the sequence and with E-value ≤0.001 being included in the set. Members of five major ‘druggable’ protein families (GPCRs, kinases, ion channels, nuclear hormone receptors and phosphodiesterases) were extracted from KinaseSarfari (63), GPCRSarfari (64) and IUPHARdb (65) and included in the Tier 3. Extracellular proteins were identified using annotation in UniProt (66) and Gene Ontology (GO) (67). Since the potential size of the secreted/extracellular portion of the proteome (i.e., potential targets for monoclonal antibodies) is large, and the available number of markers for inclusion on the array was limited, this dataset was restricted to those proteins for which higher confidence annotations of extracellular localisation were available (not solely prediction of a signal peptide). Proteins annotated in UniProt as having a ‘secreted’ subcellular location, those containing a signal peptide, or those annotated as ‘Extracellular’ (where these annotations were supported by the following evidence types: experimental, probable, by_similarity) were included in Tier 3. Proteins annotated in GO with Cellular Component terms: GO:0005576 : extracellular region, GO:0005615 : extracellular space, GO:0005578 : proteinaceous extracellular matrix, GO:0031233 : intrinsic to external side of plasma membrane, GO:0031232 : extrinsic to external side of plasma membrane, GO:0071575 : integral to external side of plasma membrane, GO:0031362 : anchored to external side of plasma membrane, GO:0009897 : external side of plasma membrane, GO:0044214 : fully spanning plasma membrane, and supported by strong evidence (EXP, IDA, TAS), were also included in the tier. Finally, proteins known to be cluster of differentiation antigens (CD antigens), according to UniProt were also added to Tier 3. Since the final set of genes included in Tier 3 was large (2370 genes), this Tier was further subdivided to prioritise those genes that were in proximity (+/- 50Kb) to a GWAS SNP and had an extracellular location (Tier 3A). The remainder of the genes were assigned to Tier 3B.
Pfam-A domain content
To evaluate the Pfam-A domain content for druggable genes, gene identifiers were converted to UniProt accession keys using the the UniProt web services (66). Only UniProt accessions matching the regular expression pattern ‘[OPQ][0-9][A-Z0-9]{3}[0-9]’ were retained for further analysis. Pfam-A domains were extracted using the Xfam API (68). For genes mapping to multiple UniProt accessions, we retained domain annotations for the UniProt accession mapping to the highest number of unique Pfam-A domains.
Comparison of druggable gene sets
For comparison with genes covered on the Illumina DrugDev array, sets of druggable genes defined by Hopkins and Groom in 2002 and Russ and Lampel in 2005 were obtained from DGIdb. Gene names were converted to Ensembl gene identifiers using the Ensembl REST API (69). The overlap between the three sets was determined and visualised using the Python module matplotlib_venn.
Compilation of GWAS results
The GWAS catalog was downloaded from (http://www.ebi.ac.uk/gwas/api/search/downloads/alternative) on 21/07/2015. Several quality control and further post processing steps were then taken. The identifiers of associated variants were validated against Ensembl (version 79, build 37) using the perl API. This step returned the latest identifier and the build 37 coordinates; 707 associated variants could not be validated and were excluded. The GWAS catalog provides numerical effect estimates but does not specify the type of effect e.g odds ratio (OR) or beta co-efficient. Attempts were then made to resolve by utilising data in other fields (e.g. the presence of case or control in the discovery population fields) to classify the effect type as OR, beta or unknown. The discovery population field was also processed using a set of regular expressions to determine the sample size and populations used. The populations were then mapped to an appropriate 1000 genomes super population. Where no population name could be identified, EUR was used as a default as the majority of studies in the GWAS catalog were performed on Europeans. The pubmed identifier field was used to search pubmed using the Biopython API. MeSH terms for the publications were mapped to the association to provide structured phenotype descriptions. However, these study level descriptions may not apply to every association reported by the study, therefore the MeSH terms were manually curated for each association. These supplemented the experimental factor ontology terms (EFO) that are already present in the GWAS catalogue. Finally, the associations were filtered for those that are ≤ 5×10-8 so all data using in this study exceeded genome-wide significance.
Assignment of LD intervals
The complete 1000 genomes phase 3 data (release 5) was downloaded from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502. BCFTools (v1.2 using HTSlib 1.2.1) and used to subset the vcf files into sub- and super- population files (70). For each population group, Plink v1.90b3d (71) was used to perform pairwise LD (r2) calculations between all variants in the processed GWAS catalog and bi-allelic 1000 genomes variants within a 1Mb flank either side of the GWAS variant having a maf ≥ 0.005. To reduce file size only r2 values ≥ 0.2 were output. The extremities of the LD region surrounding each GWAS SNP were defined by the positions of the variants furthest upstream and downstream of this SNP with an r2 value ≥ 0.5. Associated variants that were not present in the 1000 genomes panel that were not in LD with any other variants were given a nominal flank of 2.5Kb either size of the association.
Linking GWAS and drug target data
Gene annotations were extracted from Ensembl version 79. After filtering out pseudogenes 38,352 genes remained. The set of genes was further reduced to those that overlapped an LD region surrounding an association. Within each associated LD region the absolute base pair distance of the closest point of a gene from the associated variant was calculated. Variants located within a gene were given a distance of 0bp. Genes were given a distance rank value according to their base pair distance. In the event of a distance rank tie, the gene with the oldest annotation date was assigned the lower rank.
Drug targets in ChEMBL 20 are annotated with UniProt accessions. The accessions were converted to Ensembl gene identifiers using the UniProt ID mapper (http://www.uniprot.org/uploadlists/). Drug target Ensembl gene IDs were then intersected with the IDs of genes within LD regions to give a set of drug targets in the proximity of associated variants.
Evaluation of consistency between licensed drug indications and GWAS disease/biomarker traits
We evaluated the concordance between drug indication and disease association for those LD intervals defined by a GWAS SNP containing one or more genes encoding the target or targets of licensed drugs (Supplementary Figure S4). Two experienced clinicians used a pre-specified classification system developed in a pilot study of one-fifth of the total data set. Each physician was blinded to the identity of the gene encoding the druggable target within each LD interval. The outputs from the two physician-curators were then compared, any coding errors corrected, and inconsistencies between curators resolved by consensus, where agreement could be reached.
Category 0 referred to a situation where coding could not be completed because of missing data; 1 to a precise drug indication-target gene-disease association match; 2 to a drug indication-target gene-disease area association match; and 3 to a drug indication-target gene- mechanism-of-action association match. Categories 1 to 3 were defined as ‘concordant’. Category 4 referred to a drug mechanism based adverse effect-target gene-disease-association match; 5 to a drug indication-target gene-disease association mismatch with prior biological plausibility and 6 without prior biological plausibility; 7 to a trait unlikely to be of therapeutic interest (e.g. hair colour); and 8 to a genetic association with a novel biomarker of uncertain biological function (e.g. a metabolite measured by a metabolomics platform). For certain drug targets/genes, a 34 code was used to indicating that the genetic association finding could reflect both a mechanism of action and mechanism based adverse effect rediscovery. For example, the modification of certain electrocardiographic parameters by variants in the targets of certain antiarrhythmic drugs could reflect both their mechanism of action and the mechanism by such drugs produce their adverse effects. A 54 code was used when there was uncertainty about the direction of effect. A 9 code was assigned to the four cases where consensus could not be reached between the two curators. Categories 4, 5, 54, and 6 were referred to as discordant. Categories 1-4 and 34 were referred to collectively as ‘GWAS rediscoveries’ of known drug effects.
Estimates of and confidence interval for the proportion of druggable genes in LD intervals
The proportion of druggable genes in LD intervals specified by GWAS associations in each MeSH disease or MeSH psychiatry category was calculated by dividing the number of druggable genes by the number of all genes with. 95% confidence intervals calculated assuming a binomial distribution, on the assumption that each study was independent.
Design of the Illumina DrugDev Array and comparative analysis of coverage of variation in the druggable genome
Selection of custom SNP content
The design was based on three tiers, corresponding to the level of evidence for druggability of the encoded proteins, with highest priority given to genes in Tiers 1 and 2. Tag SNPs were selected from the 1000 genomes European ancestry populations (CEU/GBR/FIN/TSI). Associations (tagging) between SNPs were identified based on linkage disequilibrium (r2 >0.8). SNPs already covered, or tagged by the Human Core base content were not duplicated. Only SNPs with a minor allele frequency ≥1.5% were considered for inclusion. The tagging threshold was defined as the number of variants a SNP tags (including itself) and was varied according to the tier. For Tiers 1 and 2 a tagging threshold of 1 was applied, meaning that all SNPs were considered for inclusion, even if they only tag themselves. For Tier 3A a tagging threshold of 3, and for Tier 3B a threshold of 4 was used. SNPs were selected only if they were positioned within +/-2.5Kb of the druggable genes selected in the three tiers (defined as a region of 2.5Kb upstream of the Ensembl gene start position to 2.5Kb downstream of the Ensembl gene end position). SNPs from the Illumina Exome array were also included in the custom content where these were found within genes in Tiers 1, 2 and 3A. Again, any redundancy with the Human Core and selected tag SNP content was eliminated. A collection of mitochondrial tag SNPs from the Broad Institute, designed to capture common variation within the mitochondrial genome, were also included in the custom content ((http://www.broadinstitute.org/mpg/tagger/mito.html). This set comprises 64 SNPs, however only 56 of these loci were designable and included in the array. Finally, remaining space was filled with lead SNPs for any disease or trait association from the GWAS catalog, prioritising SNPs located within 50kb of a druggable gene, or within the gene boundaries of any protein-coding gene.
For Tier 1 genes, 99,102 custom markers were selected, including tag SNPs and HumanExome content. A further 17,944 of the HumanCore markers also fell within Tier 1 gene regions, giving 117,046 markers in total. Tier 2 included 40,943 custom markers and an additional 6,270 markers from the HumanCore fell within Tier 2 gene regions, resulting in a total of 47,213 markers. Genes in Tier 3 were represented by 38,858 custom markers. A further 21,626 HumanCore markers fell within Tier 3 gene regions, yielding 60,484 markers in total. In addition to coverage of genes encoding druggable targets, 6,400 SNPs associated with complex diseases or traits identified from the GWAS catalog and from selected gene-centric studies were also incorporated in the array content. Of these SNPs, 2,996 were already covered in the Human Core, or previously included in the custom content leaving 3,410 variants to be added (of which 1,395 were within Tier 1-3 gene regions). Finally, 53 mitochrondrial genome tag SNPs were also included, along with 9 mitochondrial genome exome SNPs. Considering all content, 226,138 markers were located in, or within +/-2.5 kb of, genes in the selected drugged, druggable and ADME sets. For the array as a whole, 78,175 markers were exonic, 286,577 intronic, and 27,393 located in 5’-, and 41,171 in 3’-untranslated regions respectively.
We used variants in the 1000 genomes phase 3 reference panel populations to compare coverage of the druggable genome by the new array and other commonly used genotyping arrays (see previous section). For this analysis, the variants on each array were first mapped to the 1000 genomes phase 3 reference panel and coverage then compared using two metrics: variant density (per kbp of the druggable gene) and the proportion of the variants in the druggable genome that were captured. We defined complete coverage of druggable genome as capture of all the bi-alleilic variants in a 1000 genomes phase 3 reference panel population with a minor allele frequency ≥ 0.005 (representing low frequency to common variants). Because of differences in variant content reported in successive genome builds, not all the content of the genotyping arrays could be mapped back to the 1000 genomes phase 3 reference set. However, the proportion of variants captured by each array that could be mapped to the 1000 genomes reference panel was very similar (Supplementary Figure S5).
Evaluating genotyping array coverage of the DrugDev array
The build 37 genotyping array content for the Illumina arrays was downloaded from Will Rayner's array strand website (http://www.well.ox.ac.uk/∼wrayner/strand).Where multiple versions of an array exists the latest version number was downloaded. The Affymetrix array annotations were downloaded as SQLite databases from the Affymetrix website. 1000 genomes data was processed as described in the method for creating LD regions. Variants present on the genotyping arrays were mapped to 1000 genomes phase 3 using the following sequence: variants with rs identifiers were searched against the 1000 genomes sites file, if no match was obtained then synonyms of the rs identifier (obtained from Ensembl version 79 build 37) were searched. Variants not mapping by rs identifier were then mapped by chromosome, position and alleles (flipping the strand of the alleles where appropriate). Allele frequencies and variant tagging for each sub-population group were calculated using Plink(v1.90b3d (72)), tagging was restricted to bi-allielic low-frequency and common variants (maf ≥ 0.005) within 1Mb of the source SNP. Baseline 1000 genomes coverage of the druggable genome in the different sub-populations was ascertained using Bedtools (v2.22.1) to intersect 1000 genomes variants with a maf ≥ 0.005 against the druggable gene list (including 2.5 kbp up/down stream). Proportional coverage of the druggable genome by the different genotyping arrays was then ascertained by intersecting the baseline coverage with the 1000 genomes mapped array content.
Indication and adverse effects of licensed therapies
Drug indication data was obtained from several sources. The primary source was the First Databank database (FDB, http://www.fdbhealth.co.uk/). This is a commercial database used by University College London Hospitals (UCLH) and a one off single release was kindly provided for research purposes by First Databank Europe Ltd. As FDB is used clinically this was regarded as the “gold standard” indication set used for the manual categorization of concordant/discordant drug/GWAS links (see above). FDB drug indications are tagged with Universal Medical Language System concept identifiers (CUIs) and could be mapped into MeSH and other ontologies within the UMLS meta-thesaurus (49, 73). Drug indication data was obtained from ChEMBL 21. This was obtained by manual curation and mapping of data from FDA drug labels (https://dailymed.nlm.nih.gov/dailymed/), WHO ATC classification (http://www.whocc.no/atc_ddd_index/) and ClinicalTrials.gov (https://clinicaltrials.gov) This was used to supplement the FDB data and fill in indication data for drugs that were not present the FDB release.
Side effect data was obtained from the Side Effect Resource (SIDER) database (74). The drug identifiers used in SIDER were mapped back to Chembl identifiers using a mapping file provided by SIDER. The side effects are provided as MedRA terms and UMLS CUIs and were mapped to MeSH terms using the UMLS.
Funding
Work in this paper was supported by awards from University College London Hospitals National Institute of Health Research (NIHR) Biomedical Research Centre, British Heart Foundation (BHF Project Grant PG12/71/29684), a Strategic Award from the Wellcome Trust (WT086151/Z/08/Z) and Member States of the European Molecular Biology Laboratory (EMBL). The work was also supported in part by the Rosetrees Trust. Aroon Hingorani is an NIHR Senior Investigator.
Author contributions
Chris Finan, Anna Gaulton, Felix Kruger, John Overington, Aroon Hingorani and Juan Pablo Casas developed the idea for the project and approaches to accurately connect genetic associations to drug targets and compounds. Anna Gaulton, Felix Kruger and John Overington updated estimates of the druggable genome. Luana Galver and Ryan Kelley worked with Anna Gaulton, Chris Finan, Tina Shah and Jorgen Engmann to develop SNP content for the Illumina DrugDev array. Anneli Karlsson curated target information for clinical stage drugs, Rita Santos curated target information for FDA approved drugs in the ChEMBL database, and Tom Lumbers and Aroon Hingorani compared indications and adverse effects of licensed drugs with disease associations from GWAS.
Competing interests
No conflicts of interest.
Data and materials availability
Additional materials may be made available through contact with the authors.
Acknowledgments
We thank Dr Cora Vacher and colleagues for helping to facilitate the design of the Illumina DrugDev Array. We would like to thank Dr. Reecha Sofat, Anita Jena-Smol for facilitating access to the First Databank Database and advice on designing database queries and First Databank Europe Ltd for providing a single copy of the database for research purposes.
Footnotes
One Sentence Summary: Mapping genome-wide association studies (GWAS) findings to an updated set of genes encoding drug (and druggable) targets, revealed new development and repurposing opportunities: these could be extended by deployment of genotyping arrays that ensure comprehensive capture of variation in the druggable genome, in larger samples with a broader set of disease data.