Moving Targets: Monitoring Target Trends in Drug Discovery by Mapping Targets, GO Terms, and Diseases

Drug Discovery is a lengthy and costly process and has faced a period of declining productivity within the last two decades. As a consequence, integrative data-driven approaches are nowadays on the rise in pharmaceutical research, making use of an inter-connected (network) view on diseases. In addition, evidence-based decisions are alleviated by studying the time evolution of innovation trends in drug discovery. In this paper a new approach leveraging data mining and data integration for inspecting target innovation trends protein family-wise is presented. The study highlights protein families which are receiving emerging interest in the drug discovery community (mainly kinases and G protein coupled receptors) and those with areas of interest in target space that have just emerged in the scientific literature (mainly kinases and transporters) highlighting novel opportunities for drug intervention. In order to delineate the evolution of target-driven research interest from a biological perspective, trends in biological process annotations from Gene Ontology (GO) and disease annotations from DisGeNet for major target families are captured. The analysis reveals an increasing interest in targets related to immune system processes, and a recurrent trend for targets involved in circulatory system processes. At the level of disease annotations, targets associated to e.g., cancer-related pathologies as well as to intellectual disability and schizophrenia are increasingly investigated nowadays. Can this knowledge be used to study the “movement of targets” in a network view and unravel new links between diseases and biological processes? We tackled this question by creating dynamic network representations considering data from different time periods. The dynamic network for immune system process-associated targets suggest that e.g. breast cancer as well as schizophrenia are linked to the same targets (cannabinoid receptor CB2 and VEGFR2) thus suggesting similar treatment options which could be confirmed by literature search. The methodology has the potential to identify other drug repurposing candidates and enables researchers to capture trends in research attention in target space at an early stage. The KNIME workflows and R scripts used in this study are publicly available from https://github.com/BZdrazil/Moving_Targets. Author summary In this study we have investigated target innovation in drug discovery over a period of 22 years (1995-2016) by extracting time trends of research interest (as published in the scientific literature and stored in the ChEMBL database) in certain protein classes inspecting different measures (numbers of pharmacological measurements, targets, papers, and drugs). Focusing on the most relevant protein classes in drug discovery (G protein-coupled receptors, kinases, ion channels, nuclear receptors, proteases, and transporters), we further linked single targets to Gene Ontology (GO) biological process annotations and inspected steep increasing or decreasing trends of GO annotations within target families over time. We also tracked trends in disease annotations from DisGeNET by filtering out diseases linked to targets with emerging trends in research interest. Finally, targets, GO terms, and diseases are interconnected in network representations and shifts in research foci are investigated over time. This new methodology which utilizes data mapping and data analysis can be used to explore trends in research attention target family-wise, to uncover previously unknown links between diseases and biological processes and to identify potential candidates for drug repurposing.


Introduction
The organization of proteins (and genes) into families, is usually achieved through phylogeny (amino acid sequence similarity) and -to a lesser extent -by similarity of secondary structures.
The assumption is that proteins belonging to the same family share a common function which will also be encoded in the protein sequences and their folds.
Apart from the general desire to organize information, such indexing and grouping of proteins is especially relevant in drug discovery. If we understand how protein modulation, e.g. by small molecules works for a particular protein, we also move closer towards understanding how modulating a whole group of related proteins might work. Thus, structure-function relationships are often carried out at the protein family level [1,2].Somewhat contradictory to this observation is the phenomenon of polypharmacology, where a molecule interacts with multiple (often unrelated) targets (either wanted or unwanted) [3][4][5].Therefore, across target families, close similarities in binding patterns appear to occur for similar ligands -this has been explored systematically for the identification of novel ligand-target associations [6][7][8]. A straightforward application of exploring polypharmacology of drugs is for repurposing reasons [9], especially in the area of rare diseases [10].
Setting aside the goal of unraveling ligand-protein interactions, another emerging application of target (family) indexing and characterization in drug discovery is the desire to capture the past, current, and (potential) future degree of research interest in main drug target families.
Until now, exploratory studies of research attention across target families have been carried out mainly from a static point of view with only a few analyses including the time dimension.
In an analysis article from 2011 by Rask-Andersen and colleagues [11], the authors looked into such trends as the exploitation of new drug targets over a period of 30 years and have identified three innovation peaks. Interestingly, when exploring drug-target networks their analysis revealed that newer molecular entities tend to form isolated networks pointing towards a tendency to interact with new drug targets and potentially possessing new modalities for treatment.
In a more recent analysis by Santos et al. [12] the authors highlight the main drug efficacy target families, which together account for approximately 44% of all human drug targets: Gprotein coupled receptors (GPCRs), kinases, ion channels, and nuclear receptors. The authors also present a temporal view on innovation patterns in these privileged protein classes by analyzing the proportion of approved drugs for each target class comparing different time frames. In addition, patterns of innovation in different therapeutic areas are represented in a timewise manner, but independent of the target class.
Since 2014, a strategic effort by the US National Institutes of Health termed the Illuminating the Druggable Genome (IDG) initiative focuses on exploring currently understudied, but potentially druggable, proteins [13]. In order to be able to define the degree to which a target is comparatively well studied or understudied, the existing knowledge on current associations between compounds (or drugs), targets, Mechanism of Action (MoA), and disease were systematically collected and mapped by this initiative. These efforts resulted in defining socalled Target Developmental Level (TDL) categories by which targets can be classified according to the depth of investigations undertaken for a particular target (by uniting available chemical, biological and clinical information). The four categories in the TDL scheme, Tclin (clinic: drug annotation(s) with approved MoA), Tchem (chemistry: binding small molecules with high potency), Tbio (biology: disease annotation(s) available), and Tdark (dark genome: not falling into one of the other categories), are thereafter representing different (decreasing) stages of target innovation. A multimodal web interface called PHAROS enables intuitive access to the IDG knowledgebase [14]. A similar European initiative, called Open Targets, uses human genetics and genomics data for systematic drug target identification and prioritization [15].
Considering the compound space, investigating drug-target associations does provide only a limited picture of innovation trends in drug discovery. The stage at which NMEs are approved is the very final stage of a long series of investigations and decisions taken. Thus, including all data available at the compound level over time (including bioactivity data extracted from databases and from the scientific literature) is a crucial step forward in the domain of datadriven drug discovery strategies.
The immense pool of small compound data spans two main data repertoires where preclinical data is frequently reported: scientific literature and patents. A thorough analysis of trends in publishing behavior of bioactivity compounds for novel targets (rather literature or patents first) was delivered by Ashenden et al. [16]. Preclinical data can guide decision making at earlier stages in the drug discovery pipeline since it reflects trends in research attention within the community. It might help researchers to base their research endeavors on evidence-based criteria and direct their interest towards understudied proteins or core scaffolds [17].
In this study, major target classes are monitored over time to capture the different aspects of target innovation in drug discovery. First, target families of emerging interest in drug discovery are highlighted as reflected by an increasing number of targets, bioactivity measurements, annotated drugs, and number of publications. Second, trends in GO (Gene Ontology) biological process annotations enable to filter out biological processes of (potential) emerging attention per target family. Third, studying the evolution of disease annotations for major target families highlights popular targets of intervention for particular diseases. Finally, uniting the different views in the form of network representations by mapping targets, GO terms, and diseases, allows us to uncover new links between diseases and biological processes.

Target innovation trends
We investigated protein innovation trends by inspecting different measures reflecting research attention per target family: increase or decrease of the number of unique compounds, number of drug-efficacy target annotations (playing a role in the efficacy of the drug in the indication(s) for which it is approved), number of published papers (as expressed by PMIDs), and number of unique targets. In addition, we studied the evolution of the proportions of Target Developmental Levels (TDLs) through the years for each target family.
For all four measures describing protein family popularity in absolute numbers (Figure 1), the steepest upward trends can be observed for GPCRs and kinases over 21 years (1995-2015).
Interestingly, although the GPCR family was the most studied in terms of unique compounds and papers from the year 2000 onwards, kinases outpaced GPCRs towards the end of the , which is not reflected in a particularly large number of papers for these years (106 and 90 papers, respectively). Looking a bit closer at the respective numbers of targets published in the respective papers of these years, it becomes clear that these two target peaks can each be attributed to one large study, which were both dedicated to comprehensive kinase inhibitor selectivity screens [18,19]. As obvious from the steep rise in paper and compound counts from 2012/2013 onwards the two large scale screening studies (both designed by Zarrinkar et al.) might have sparked the increasing research interest in particularly this family of proteins since "the data set suggests compounds to use as tools to study kinases for which no dedicated inhibitors exist" [18]. In terms of numbers of drug-efficacy target annotations, a picture similar to that of target trends can be observed with also the highest peak in 2011 In comparison to kinases, all other target families are showing relatively smooth increases in the different popularity measures depicted. Interestingly, GPCRs, although of emerging interest until recently, were obviously not so much affected by large screening studies (which is revealed by their high paper counts over the years and relatively flat compound and target trends between 2005 and 2015). Membrane proteins like GPCRs are extremely challenging targets when it comes to protein purification and crystallization. Therefore -although 33% of all small molecule drugs are targeting GPCRs -they account for only 12% of human protein drug targets [12].
From Figure 1 it also becomes visible that ion channels have outperformed proteases in terms of paper numbers over the years. This is not unexpected since ion channels according to Santos et al. make up the largest proportion of drug targets inspecting the individual protein families (19%) [12]. However, to date this is not so well reflected in terms of literature-reported compounds, targets and drug-efficacy target annotations ( Figure 1). By relating the absolute numbers from target innovation trends (as represented in Figure 1) to the respective total numbers of compounds, papers, drug-efficacy target annotations, and targets per year, relative trend measures are observed. These enable temporal comparison across the target families more easily, particularly when evaluating the early years of rational drug discovery, which tend to be relatively data poor. Notably, nuclear receptors outperformed all other target families in terms of drug-efficacy target annotations from 1998-2004, but not in terms of other popularity measures ( Figure 2). This is also reflected by inspecting the proportions of small-molecule drugs and human drug targets, discussed in Santos et al. [12]: nuclear receptor drugs make up 16% of all drugs, but only 3% of drug targets. Inspecting the proportions of drug-efficacy target annotations over the whole observation period (Figure 2), it becomes clear that from 2005 onwards primarily GPCRs and later kinases display a relative enrichment in this popularity measure (in relation to drug-efficacy target annotations for all six target families). Notably, interest in kinases seems to be a more recent trend exemplified by the relative enrichment in all popularity measures under investigation by 2016 ( Figure 2). A multi-dimensional measure of (evidence-based) research attention in particular human protein targets is provided by the Target Developmental Level (TDL) categories that were assigned by the IDG initiative [13,14]. By looking at the time evolution of TDL categories protein family-wise (Figure 3), differences with respect to stages of drug development between the families can be captured. It has to be noted, that this visualization does not show the entirety of protein targets for each target family but reflects research endeavors, successes and future opportunities as reflected in the scientific literature (and recorded by ChEMBL).
From Figure 3 it becomes obvious that within the Tclin category, ion channels and GPCRs make up the largest fraction of targets, consistently over the years. The majority of well-studied ion channel and GPCR proteins are already linked to at least one approved drug by MoA and proportionally not many new targets without known drugs (categories Tchem and Tbio) have been reported in the scientific literature within the last ~10 years. Inspecting the nuclear receptor superfamily the same tendency holds true for 1995-2005 but later the proportion of targets belonging to Tchem continuously increased, pointing to investigations on novel targets with potent compounds (but lacking approved drugs), such as investigations on metabolic nuclear receptors [20,21].
In general, a higher fraction of targets from the categories Tchem or Tbio in recent years, highlights opportunities for drug discovery that are underexplored. These tendencies are more pronounced for the protein classes of transporters, kinases, and proteases (see Figure 3). For kinases, even targets of the category Tdark in 2008 and 2011 (one target each) are observed.
The highest fractions of Tchem and Tbio (compared to all other years) can also be attributed to the same years which is the result of two large screening studies described above [18,19].
Taken together the results from studying the evolution of target trend measures, a shift from GPCR-focused to kinase-focused research can be observed within the last two decades.
Results from tracking the target developmental levels suggest that future research be directed towards unexploited areas in kinase, protease, transporter, and nuclear receptor focused studies since these families are still offering a wealth of targets with no approved drugs. It will however depend on the available research infrastructure, collaborations, and funded projects, if the "hard targets" (e.g. transmembrane transporters) can also be systematically deorphanized in the future. Initiatives like the EU-funded project Resolute [22] which concentrates on deorphanizing Solute Carriers will hopefully pave the way for other projects uncovering the unexploited opportunities in the human proteome.

Evolution of Gene Ontology annotations
Moving from the level of target families to the level of individual targets enables us to explore biological aspects of target innovation trends encoded in Gene Ontology (GO) annotations.
Here, we focus on GO biological process terms as they "represent a specific objective that the organism is genetically programmed to achieve". Thus, by investigating steep increasing or decreasing trends of certain GO biological process annotations within target families over time, whereas for proteases and NRs exclusively negative GO trends were extracted (proteases: 15; nuclear receptors: 10;). While for transporters no significant trends could be extracted by fitting a robust regression, the family of GPCRs delivered both positive and negative GO biological process annotation trends, but with the majority being negative (2 positive, 14 negative trends; see Table 1 and Supplementary Figure S1). Strikingly, for GPCRs, the biological process GO annotation "immune system process" (definition: any process involved in the development or functioning of the immune system; see also Supplementary Table S1) shows by far the steepest increasing trend (measured as total bioactivity counts) with a peak in 2009 (approx. 14% of all bioactivity measurements). The identified trend underpins the increasing interest in GPCRs as possible modulators of the immune system, such as the regulation of T cell activation, homeostasis and function [23]. Through inspection of the individual contribution of certain target classes to this pronounced trend, the research interest in target classes over time being involved in a certain biological process (i.e., "immune system process") is delivered. For GPCRs, the positive trend observed for the "immune system process" GO term is largely driven by an increase in investigations on proteins belonging to the classes of opioid receptors, cannabinoid receptors, and CC-chemokine receptors ( Figure   4).
Investigations on opioid receptors have shown that their role in immune function can be manifold, ranging from immunosuppression, immunomodulation, and dual effects [24]. Opioid abuse itself can accelerate the progression of HIV infection, which is believed to happen due to changes of the immune status [25]. Likewise, cannabinoid receptors can mediate the regulation of immune function. Thus many in vitro and in vivo studies aim to exploit the putative therapeutic potential of cannabinoid signaling in inflammation-accompanied diseases (e.g., multiple sclerosis) [26]. Chemokine receptors have been identified as interesting targets for antineoplastic therapies since they are expressed on the surface of both tumor and immune cells [27]. They regulate proliferation, cell survival, differentiation, immune cell activation, cell migration, and death. Since there exist both, pro-and anti-inflammatory cytokines, they can either repress tumor growth or induce cell transformation and malignancy, always depending on the current state of the tumor microenvironment [28]. Since the immune system in general plays a critical role in various physiological and pathological processes, such as inflammation, autoimmunity, tumor growth, metastasis etc. there is a huge potential for drug discovery and development in exploiting the listed target classes further.
In the case of kinases, investigations on targets annotated with "immune system process" have also increased during the investigated period (Table 1). Notably, the main contributing class of proteins for this emerging trend are Janus kinases (JAKs) in recent years ( Figure 5), here especially tyrosine kinase 2 (TYK2, JAK-A and JAK-B; Supplementary Figure S2). This class of non-receptor tyrosine kinases [29] are essential mediators for downstream signaling via STAT (signal transducer of activation) proteins and require a soluble factor to associate extracellularly with the corresponding transmembrane receptor for their activation, including cytokines and hormones. Since the JAK/STAT pathway is the major paradigm for membrane to nucleus signaling, it is also recognized as an important regulator of the immune system including the resistance to infection and maintenance of immune tolerance [30,31]. JAKs are therefore targeted by many modern pharmaceuticals for e.g. treatment of autoimmune diseases such as myelofibrosis and rheumatoid arthritis. In addition to their role in immunity and inflammation, especially TYK2 plays an important role in the prevention and activation of carcinogenesis as well as unexpected tumor rejection [32].
While attention in GPCR protein classes involved in "immune system processes" increased since 1995, the research interest in protein classes involved in "circulatory system processes" (definition: organ system process carried out by any of the organs or tissues of the circulatory system) decreased within the same time period for GPCR proteins. Specifically, a shrinking percentage of bioactivity data was reported for dopamine receptors, adrenergic receptors, and adenosine receptors ( Figure 6). Such a trend points to decreasing interest in GPCRs as direct drug targets for cardiovascular diseases. Instead, molecular targets which are part of the signaling pathways that these GPCRs elicit (such as beta-arrestins) are nowadays investigated as promising new avenues for treating heart diseases [33].
Interestingly, when inspecting trends in the higher level GO term "signal transduction", which delivered one of the steepest positive GO annotation trends for kinases and a slightly negative trend for GPCRs (see Table 1), familiar protein classes mainly contribute to these trends.

Evolution of disease annotations
At the level of diseases, target annotations appear to provide even more meaningful insights into target innovation trends since they delineate the extent to which the involvement of a protein (or protein class) in a certain biological process can potentially be translated into the development of pharmaceuticals for the treatment of a particular disease.
With respect to the overall distribution of the trends among the different target families, disease annotations seem to behave in analogy to GO biological process annotations: For kinases (35 diseases) and ion channels (3 diseases) exclusively positive trends were found to be statistically significant; whereas for nuclear receptors (13 diseases) and proteases (23 diseases) the disease trends are exclusively recurrent. Transporters do not follow any significant disease trends, but GPCR disease trends can be divided into positive (9 diseases) and negative (16 diseases) trends (see Table 2). No particular attempt was undertaken to filter out disease annotations that are describing tightly connected diseases (e.g. "endometrioma" and "endometriosis") or subcategories of another disease annotation (e.g. "pancreatic neoplasm" and "malignant neoplasm of pancreas") in order to not bias the trends by individual knowledge about certain diseases.
Across all target families, targets annotated to different types of cancer are by far the most frequently investigated ones as they show steep increasing trends in research attention over 22 years. This not only accounts for kinases, but also for ion channels, with a single unique disease annotation -"benign/malignant neoplasms" ( Table 2 and Supplementary Figure S4 blocks key regulators of EGFR/HER2 signaling pathways, as well as multiple compensatory pathways, such as AKT, HER3, and MET [36]. The increasing interest in VEGFR as target for antineoplastic drugs is attributed to its effect in preventing tumor-neovascularization (antiangiogenic therapy) [37].
The JAKB family which has been identified as the major contributor to the steep GO biological process trend "immune system process" for kinases only appeared to contribute to the emerging trend in the case of "malignant neoplasm of prostate", "prostatic neoplasms", and Besides cancer-related diseases, "intellectual disability" (subnormal intellectual functioning which originates during the developmental period; previously referred to as "mental retardation") [38] and "schizophrenia" are also among the steep positive disease trends for kinases ( Table 2, Supplementary Figure S6 and S7). Interestingly, for one of the main protein targets contributing to both positive disease trends in recent years -the hepatocyte growth factor receptor (c-MET) -associations to the neurodevelopmental disease "intellectual disability" are not described in the scientific literature to date, but were retrieved merely from the publicly available knowledge base "Genomics England PanelApp" [39] (via DisGeNET) [40]. The same accounts for the association between JAK3 and "intellectual disability". Such findings are especially interesting in the context of novel therapeutic opportunities and drug repurposing since a strong genetic link between the target and the disease is confirmed but research in the respective indication area is still in its childhood. For schizophrenia, however, an influence of mutations in the MET proto-oncogen was investigated systematically since a lower incidence of cancer in schizophrenia patients was observed previously [41]. The authors also postulate an influence of MET variation in neurocognition and neurodevelopmental processes in general, since MET was also found to be involved in autism spectrum disorder (ASD) [42]. In contrast, other protein-disease associations contributing to this steep trends for "intellectual disability" and "schizophrenia" are better documented in the scientific literature to date, such as the involvement of the serine/threonine kinase AKT in these diseases [43,44].
Vascular endothelial growth factor receptor 2 (VEGFR2) is the most heavily studied protein that is associated with "schizophrenia". The growth factor binding to this receptor (VEGFA) has been implicated in neurotrophy and neurogenesis as well in the pathophysiology of schizophrenia [45].
In the case of ion channels, the class of voltage-gated potassium channels is mainly contributing to the increasing disease trend "neoplasms" (identical to the trends for "malignant neoplasms" and "benign neoplasm"), and to a lesser extent transient receptor potential (TRP) channels. Interestingly, there is only a single representative from each class contributing to this trend, respectively: HERG and the vanilloid receptor (Supplementary Figure S9). Overand/or mis-expression of HERG in tumor cells has been identified as a critical factor for establishment and maintenance of neoplastic growth [46]. Thus, HERG inhibitors can reduce proliferation both in vitro and in vivo which offers new antineoplastic strategies [47]. Also, vanilloid receptors have been found to regulate cancer cell proliferation, angiogenesis and apoptosis and thus have been suggested as potential anticancer targets and prognostic markers in recent years [48]. As already apparent from GO biological process annotations, interest in GPCR targets associated to cardiovascular diseases decreased which is confirmed by the declining interest in GPCRs associated with cardiovascular diseases (such as "hypotension", "bradycardia", and "hypertensive disease"). In addition, mood disorders seem to be subjected to declining

Dynamic networks to study the evolution of associations between targets, GO terms, and disease annotations -Uncovering new links between biological processes and diseases
In the previous sections, In order to include the dimension of time like in the precedent investigations, the study period was split into two equal parts (11 years each) and consequently two independent networks were generated for each investigated disease or GO annotation. Thus, in conjunction, the network views provide a "dynamic network" perception which delineates major changes in network connectivity over time. In contrast, dynamic protein class-disease networks for a particular biological process (GO term) over time as exemplified in Figure 11 for "immune system process" delivers a comprehensible overview of the degree of interest in target classes that might be leveraged for operating through the particular process. By creating a single network for multiple target families (for kinases and GPCRs in Figure 11) dominant targets/target classes across families can be identified, such as e.g. cannabinoid receptors and Janus kinases in the more recent period for "immune system processes" (Figure 11). A list of all target classes annotated to "immune system process", including their absolute and relative bioactivity amounts within the two time periods, is given in Supplementary Tables S6 and S7. Although appearing in the same network, cannabinoid receptors and Janus kinases are showing differences in their associations to diseases. Cannabinoid receptors are linked to schizophrenia, different forms of breast cancer, and unipolar depression (to name just the top ranked diseases). In contrast, Janus kinases are not linked to breast cancer, but to intellectual disability (which in return is not linked to cannabinoid receptors). The association between Janus kinases and intellectual disability is a quite recent finding since the role of the JAK-STAT pathway in neuronal differentiation and gliogenesis is only insufficiently understood to date [50].
Looking for possible links between diseases at the level of a conjoint biological process ("immune system process"), it becomes obvious that the cannabinoid receptor CB2 and VEGFR2 are both strongly linked to breast cancer as well as schizophrenia (Supplementary Figure S12). A link between cancer and schizophrenia has been proposed earlier [41,51], as well as an influence of CB2 activation or inhibition on VEGFR2-signalling [52,53]. The identical target-connectivity of these classification-wise disparate pathologies suggests similar treatment options for both. While the use of the antipsychotic drug pimozide for the treatment of schizophrenia is under active investigation for more than two decades [54], it was only recently reported as a potential new route for treating triple-negative breast cancer [55].
Pimozide suppressed cell growth of breast cancer cells in vitro and blockage of VEGFR2 was suggested to be responsible for inhibiting tumor development [55].
Inspecting target trends of target classes annotated to "circulatory system processes" confirms the previous observation of declining research interest of GPCR proteins associated to cardiovascular diseases: e.g., target classes and single protein targets annotated to hypotension show declining interest nowadays (visible from Figure 11 and Supplementary Figure S13 and quantified in Supplementary Tables S8 and S9). Instead we can observe more interest in targets annotated to different forms of epilepsy (as identified earlier when inspecting the evolution of disease annotations; Table 2). Strikingly, cannabinoid CB1 receptors have caught up with dopamine D2 and adenosine A2a receptors, as visible from the single targetdisease networks in Supplementary Figure S13.
The outlined examples demonstrate the potential of the method which uses data mapping and data visualization for the identification of malignancies and potential candidates for drug repurposing.

Summary and Conclusions
In Two orthogonal views of target innovation are considered by investigating trends in GO biological process terms and disease annotations over time. We note that no proof can be provided whether there is a causal link between attention being paid to a target or target family and a detected emerging disease or biological process trend. We investigated how intense a particular target was investigated and reported in the medicinal chemistry literature without extracting information on potential indication areas from the respective literature sources.
Thus, the emerging biological process and disease trends rather highlight opportunities for therapeutic intervention rather than a real current focus on a particular disease or biological process.
Through data mapping in conjunction with data visualization, trends in research attention can be captured at an early stage also for researchers who are not experts in a particular research domain. In addition, not all links between targets and diseases are described in the scientific literature, but are only available from different knowledge bases and are made visible through the data mapping procedure (such as the link between "intellectual disability" and c-MET).
Importantly, the described methodology to analyse and visualize the data can only depict information that is already available. Its strength lies in the interconnection of the disparate scientific sub-domains: drug discovery and biology linked by bioinformatics analysis. This implies that the method works well when considering data-rich domains, as underrepresented targets families will tend to be ignored if one considers only the number of measurements.
However, it can in principle also be applied to only a specific target family or target class in order to identify emerging trends within that particular domain.
One limitation is the availability of GO and disease annotations for particular targets which might be unevenly distributed, biasing the amount of available annotations to already heavily studied targets. An extreme example are orphan targets, where no potent compounds are known and there are few if any diseases or biological process associations.
The question if such an analysis rather contributes to a reinforcement in the study of already heavily investigated targets and diseases -rather than inspiring research on novel targets and in new therapeutic areas -remains. Most likely, the relationships between compounds, biological processes and diseases that are being unearthed by the network representations, and more specifically the dynamic changes of these relationships over time, will deliver the most useful information at the intermediate stage of developmental level of a target. For those cases, a link between the target and a disease has already been demonstrated by the available information, but drug discovery for that particular target or for the envisaged therapeutic indication should ideally be in its childhood still.
Thus -if used in a comprehensive manner -analysing the temporal "movement of targets" from a data science perspective enables researchers to look back on their way forward when planning and directing their future research.

Data retrieval
Data was retrieved from ChEMBL23 [56] via a PostgreSQL interface (pgAdmin3) by querying for compounds along with their bioactivity information (activity type, activity standard value, standard unit etc.), structural information (canonical smiles, standard InChI, and InChIKeys), target information (including protein classification), assay information and document information. In this initial query, filters for the target organism ("Homo sapiens"), target type ("SINGLE PROTEIN"), assay confidence score (>8), document year (>1985), and standard units for activities ("NOT NULL") were set. This led to an initial dataset with 977,500 data entries (314,406 unique compounds).
Next, the data file was imported into a KNIME (version 3.5) [57] environment to perform data filtering, processing and analysis (details are provided in the following subsections).

Data filtering
The raw data (ca. 978 K data points) was filtered for compounds with a MW <= 800, publication years 1995-2016, and data points with a pChEMBL value only (nanomolar unit and "=" as relation sign for bioactivity measurement). This led to a final data set of 514,110 data points (248,822 unique compounds).

Retrieving and mapping of gene ontology annotations
A list of all protein targets with their respective protein classification as well as GO annotations was retrieved in a separate sql query (data set with 153,000 data points). Subsequent mapping of target ID's (via the ChEMBL target tid's) to protein classes and GO annotations was performed in KNIME. GO annotations were filtered for "GO biological process" annotations only.

Filtering for individual protein families
The data set containing mappings for targets, GO terms, and diseases was split into six major protein classes by filtering the column "protein_class_desc" (protein class description) for *kinase*, *7tm1*, *transcription factor*, *protease*, *ion channel*, *transporter* (to filter for kinases, GPCRs, nuclear receptors (NR), proteases, ion channels, and transporter, respectively). Mapping these six subsets with the large data set (ca. 514 K data points) via tid's led to the final data sets for the six target classes: approximately 99 K data points for kinases, 144 K for GPCRs, 28 K for NR, 42 K for proteases, 24 K for ion channels, and 20 K for transporter. In a separate step, tid -UNIPROT ID mappings were retrieved in a sql query and UNIPROT IDs were added to the final six target class data sets.

Retrieving drug-efficacy target annotations
ChEMBL provides an annotation, called DISEASE_EFFICACY, which flags whether the target assigned is believed to play a role in the efficacy of the drug in the indication(s) for which it is approved (1 = yes, 0 = no). These annotations were retrieved from ChEMBL23 via a separate sql query and mapped via target tid's and drug molregno's.

Mapping target developmental level (TDL) categories
Mappings of ChEMBL protein targets to TDL categories were kindly provided by Vishal Shiramshetty (NCATS/NIH).

Retrieving and mapping of disease annotations
Curated gene-disease associations (81,746) and mappings of UNIPROT IDs and gene IDs were downloaded in csv format from DisGeNET and imported in KNIME. These gene-disease annotations were collected from UNIPROT, CGI, ClinGen, Genomics England, CTD (human subset), PsyGeNET, and Orphanet and provide a score to rank these associations by taking into account the number and type of sources (level of curation, organisms), and the number of publications supporting the association. Only associations with a score equal or greater than 0.3 were retained (at least 1 curated source reporting the gene-disease association; 81,386 associations). First, UNIPROT IDs were mapped to diseases via the associated gene IDs in KNIME. Then, UNIPROT IDs served to map diseases to the six target class data sets.

Trend analyses
Data processing for subsequent trend calculation was performed in KNIME. Target innovation trends, GO term trends, and disease trends were described by a robust linear regression using

Linking GO annotations and diseases
The separate data sets with the mapped GO terms and mapped disease annotations were linked by first filtering for a particular disease (e.g. "malignant neoplasm of breast") or a particular GO annotation (e.g. "immune system process") and then mapped via targets. To study the target connectivities of GO terms and diseases that have been identified to cause steep increasing or decreasing trends in research attention considering a particular target family, network representations were produced usually for data from that particular target family only (in case of "immune system process" a single network was created for data from both kinases and GPCRs). Static networks were created for two different time periods: 1995-2005 and 2006-2016. Taken together the two networks form a dynamic network representation, being able to compare changes in connectivities over time.

Data visualization
Line plots and bar charts were produced in R by using the package "ggplot2" (version 3.1.0), network graphs by using the package "ggraph" (version 1.0.2).
The data for creating the stacked bar charts showing the proportions of different target classes contributing to a particular GO biological process or disease was prepared in KNIME beforehand. Only protein classes are shown as separate blocks if their percentage of bioactivities in a particular year is at least 10% of the largest contributing protein class (in any year). Target classes with minor contribution to a trend are denoted as "X_targets" in the visuals.
Bipartite network representations for selected diseases and GO terms were constructed by using the package "ggraph" (version 1.0.2) in R by displaying the connectivities of protein classes (=first set of nodes) and GO terms or diseases (=second set of nodes). Only the top 5 protein classes and top 15 GO terms/diseases are displayed respectively (with respect to percentages of annotated bioactivities). The disease-target class network for "immune system process" displays the top 10 protein classes, since these include both proteins from the kinase and GPCR families.

Data, code, and workflow availability
All data sets, R code, as well as KNIME workflows used for performing the analyses are publicly available: https://github.com/BZdrazil/Moving_Targets.

Authors' contributions
BZ designed the study, performed the analysis, and wrote the manuscript. LR helped in designing the study and performing the analysis. NB and RG advised during the process of designing and performing the study, and helped writing the manuscript. All authors read and approved the final manuscript.

GO biological process annotation GO definition GO ID
aging A developmental process that is a deterioration and loss of function over time. Aging includes loss of functions such as resistance to disease, homeostasis, and fertility, as well as wear and tear. Aging includes cellular senescence, but is more inclusive. May precede death and may succeed developmental maturation (GO:0021700).
GO:0007568 anatomical structure development The biological process whose specific outcome is the progression of an anatomical structure from an initial condition to its mature state. This process begins with the formation of the structure and ends with the mature structure, whatever form that may be including its natural destruction. An anatomical structure is any biological entity that occupies space and is distinguished from its surroundings. Anatomical structures can be macroscopic such as a carpel, or microscopic such as an acrosome. GO:0048856 anatomical structure formation involved in morphogenesis The developmental process pertaining to the initial formation of an anatomical structure from unspecified parts. This process begins with the specific processes that contribute to the appearance of the discrete structure and ends when the structural rudiment is recognizable. An anatomical structure is any biological entity that occupies space and is distinguished from its surroundings. Anatomical structures can be macroscopic such as a carpel, or microscopic such as an acrosome. The chemical reactions and pathways involving lipids, compounds soluble in an organic solvent but not, or sparingly, in an aqueous solvent. Includes fatty acids; neutral fats, other fatty-acid esters, and soaps; long-chain (fatty) alcohols and waxes; sphingoids and other long-chain bases; glycolipids, phospholipids and sphingolipids; and carotenes, polyprenols, sterols, terpenes and other isoprenoids. GO:0006629 locomotion Self-propelled movement of a cell or organism from one location to another GO:0040011 metabolic process The chemical reactions and pathways, including anabolism and catabolism, by which living organisms transform chemical substances. Metabolic processes typically transform small molecules, but also include macromolecular processes such as DNA repair and replication, and protein synthesis and degradation. GO:0008152 multicellular organism(al) development The biological process whose specific outcome is the progression of a multicellular organism over time from an initial condition (e.g. a zygote or a young adult) to a later condition (e.g. a multicellular animal or an aged adult). GO:0007275 nucleocytoplasmic transport The directed movement of molecules between the nucleus and the cytoplasm. The production of new individuals that contain some portion of genetic material inherited from one or more parent organisms. GO:0000003 response to stress Any process that results in a change in state or activity of a cell or an organism (in terms of movement, secretion, enzyme production, gene expression, etc.) as a result of a disturbance in organismal or cellular homeostasis, usually, but not necessarily, exogenous (e.g. temperature, humidity, ionizing radiation). GO:0006950 response to toxic substance Any process that results in a change in state or activity of a cell or an organism (in terms of movement, secretion, enzyme production, gene expression, etc.) as a result of a toxic stimulus. GO:0009636 signal transduction The cellular process in which a signal is conveyed to trigger a change in the activity or state of a cell. Signal transduction begins with reception of a signal (e.g. a ligand binding to a receptor or receptor activation by a stimulus such as light), or for signal transduction in the absence of ligand, signal-withdrawal or the activity of a constitutively active receptor. Signal transduction ends with regulation of a downstream cellular process, e.g. regulation of transcription or regulation of a metabolic process. Signal transduction covers signaling from receptors located on the surface of the cell and signaling via molecules located within the cell. For signaling between cells, signal transduction is restricted to events at and within the receiving cell. GO:0007165 (regulation of) symbiosis, encompassing mutualism through parasitism Any process that modulates the frequency, rate or extent of symbiosis, an interaction between two organisms living together in more or less intimate association. GO:0043903 transmembrane transport The process in which a solute is transported across a lipid bilayer, from one side of a membrane to the other. GO:0055085 transport The directed movement of substances (such as macromolecules, small molecules, ions) or cellular components (such as complexes and organelles) into, out of or within a cell, or between cells, or within a multicellular organism by means of some agent such as a transporter, pore or motor protein. GO:0006810 vesicle-mediated transport A cellular transport process in which transported substances are moved in membrane-bounded vesicles; transported substances are enclosed in the vesicle lumen or located in the vesicle membrane. The process begins with a step that directs a substance to the forming vesicle, and includes vesicle budding and coating. Vesicles are then targeted to, and fuse with, an acceptor membrane. GO:0016192