Exploring the understudied human kinome for research and therapeutic opportunities

The functions of protein kinases have been widely studied and many kinase inhibitors have been developed into FDA-approved therapeutics. A substantial fraction of the human kinome is nonetheless understudied. In this perspective, members of the NIH Understudied Kinome Consortium mine publicly available databases to assess the functionality of these understudied kinases as well as their potential to be therapeutic targets for drug discovery campaigns. We start with a re-analysis of the kinome as a whole and describe criteria for creating an inclusive set of 710 kinase domains as well as a curated set of 557 protein kinase like (PKL) domains. We define an understudied (‘dark’) kinome by quantifying the public knowledge on each kinase with a PKL domain using an automatic reading machine. We find a substantial number are essential in the Cancer Dependency Map and differentially expressed or mutated in disease databases such as The Cancer Genome Atlas. Based on this and other data, it seems likely that the dark kinome contains biologically important genes, a subset of which may be viable drug targets.


INTRODUCTION
Protein phosphorylation is widespread in eukaryotic cells (Cohen, 2002) and mediates many critical events in cell fate determination, cell cycle control and signal transduction (Hunter, 1995). The 3D structures (protein folds) (Knighton et al., 1991a) and catalytic activities (Adams, 2001) of eukaryotic protein kinases (ePKs), of which ~500 are found in humans (Manning et al., 2002), have been intensively investigated for many years: to date, structures for over 280 unique domains and ~4,000 co-complexes have been deposited in the PDB database. The fold of ePKs is thought to have arisen in prokaryotes (Lai et al., 2016) and evolved to include tyrosine kinases in metazoans (Darnell, 1997;Wijk and Snel, 2020), resulting in a diverse set of enzymes (Hanks and Hunter, 1995;Hanks et al., 1988) that are often linked -in a single protein -to other catalytic domains, such as SH2 and SH3 peptide-binding domains, and other functional protein domains. In addition, 13 human proteins have two ePK kinase domains. An excellent recent review describes the structural properties of ePKs and the drugs that bind them (Kanev et al., 2019).
With an accessible binding pocket and demonstrated involvement in many disease pathways, protein kinases are attractive drug targets (Bhullar et al., 2018). Protein kinase inhibitors, and the few activators that have been identified (e.g. AMPK activation by salicylate and A-769662 (Hawley et al., 2012)), are diverse in mechanism and structure. These molecules include ATP-competitive inhibitors that bind in the enzyme active site and non-competitive "allosteric" inhibitors that bind outside the active site, small molecule PROTAC degraders whose binding to a kinase promotes ubiquitin-dependent degradation (Jones, 2018) and antibodies that target the growth factor or ligand binding sites of receptor kinases or that interfere with a receptor's ability to homo or hetero-oligomerize (FAUVEL and Yasri, 2014). Kinase inhibitors have been intensively studied in human clinical trials and over 50 have been developed into FDA-approved drugs (Kanev et al., 2019).
Despite the general importance of kinases in cellular physiology as well as their druggability and frequent mutation in disease, a substantial subset of the kinome has been relatively little studied. This has given rise to a project within the NIH's Illuminating the Druggable Genome Program (IDG) (Finan et al., 2017), to investigate the understudied "dark kinome" and determine its role in human biology and disease. IDG has distributed a preliminary list of dark kinases based on estimates of the number of publications describing that kinase and the presence/absence of grant (NIH R01) funding; we and others have started to study the properties of these enzymes (Huang et al., 2018). Defining the dark kinome necessarily involves a working definition of the full kinome and a survey of the current state of knowledge.
Establishing the membersip of the human kinome is more subtle than might be expected. The first human "kinome" with 514 members, defined as proteins homologous to enzymes shown experimentally to have peptide-directed phosphotransferase activity, was put forward in a groundbreaking 2002 paper by Manning et al. (Manning et al., 2002). This list has subsequently been updated via the KinHub Web resource (Eid et al., 2017a) to include 522 proteins. The kinase domain of protein Kinase A (PKA), a hetero-oligomer of a regulatory and a catalytic subunit, was the first to be crystalized and is often regarded as the prototype of the ePK fold (Knighton et al., 1991a(Knighton et al., , 1991b. This fold involves two distinct lobes with an ATP-binding catalytic cleft lying between the lobes which is characterized, at the level of primary sequence, by 12 recurrent elements with a total of ~30 highly conserved residues (Lai et al., 2016). Of 514 proteins in the kinome from Manning et al, 478 have an ePK fold.
Kinases have diverged in multiple ways to generate protein folds distinct in sequence and structure from PKA including the eukaryotic like fold, the atypical fold and unrelated folds. The eukaryotic like kinases (eLKs) are similar to ePKs in that they retain significant sequence similarity to the N-terminal region of ePKs but differ in the substrate binding lobe. TP53RK, a regulator of p53 (TP53) (Abe et al., 2001) is an example of a serine/threonine protein kinase with an eLK fold. Kinases with an atypical fold (aPKs) are distinct from ePKs and eLKs in that they have weak sequence similarity to ePKs, but nevertheless adopt an ePK-like three dimensional structure. aPKs include some wellstudied protein kinases such as the DNA damage sensing ATM and ATR kinases (Abraham, 2001). The aPK and eLK folds are not limited to protein kinases. The lipid kinase PI3K, one of the most heavily mutated genes in breast cancer (Mukohara, 2015), also adopts an aPK fold (Kanev et al., 2019).
Similarly, the choline kinase CHKA, a key player in dysregulated choline metabolism in cancer and a chemotherapy target, adopts the eLK fold (Glunde et al., 2011;K et al., 2016). Over 200 additional proteins are annotated as "kinase" in UniProt. The structures of these kinases are unrelated to the protein kinase fold and they are therefore termed uPKs (unrelated to Protein Kinases). Enzymes with phosphotransferase activity in the uPK family include hexokinases that phosphorylate sugars, but also protein kinases with bromodomains (e.g. BRD2, BRD3 and BRD4) as well as STK19, which displays peptide-directed phosphotransferase activity (Yin et al., 2019). Multiple uPK proteins, including those with bromodomains, bind to small molecule kinase inhibitors (Ciceri et al., 2014) making it useful, from the perspective of drug discovery, to study kinase-like proteins at the same time as kinases themselves.
While protein kinases could in principle be defined strictly as enzymes that catalyze phosphotransfer from ATP onto serine, threonine and tyrosine, such a definition would exclude structurally and functionally related lipid kinases and well as many of the protein families relevant to drug discovery. It would also fail to account for a lack of functional data for a substantial number of proteins, potentially excluding from consideration kinases that are physiologically or catalytically active but have not yet been tested in biochemical assays. This has resulted in definitions that rely on sequence alignment and structural data to identify closely related folds (Ciceri et al., 2014); in this definition, uPKs having kinase activity as well as bromodomains that potently bind and are inactivated by kinase inhibitors are often excluded. Moreover, as Hidden Markov Models (HMMs) and other ways of recognizing kinase homology have become more sophisticated, additional proteins have been added to the kinase tree (Briedis et al., 2008) Kinases directed against molecules other than proteins or peptides regulate signal transduction and other eukaryotic regulatory pathways by phosphorylating second messengers and metabolites (Verheijen et al., 2011). These pathways often intersect with regulatory cascades controlled by protein kinases (Mosca et al., 2012) and some metabolic kinases have been demonstrated to also have activity against peptide or protein substrates (Lu and Hunter, 2018), challenging the conventional notion that small molecule and peptide-directed kinases are distinct families of enzymes. One example is the pyruvate kinase PKM2, which in addition to its well-known function in generating pyruvate from phosphoenolpyruvate, can also phosphorylates histone H3 at T11, thereby activating transcription downstream of EGFR-signaling (Yang et al., 2012).
These data suggest that it would be valuable to define the kinome along multiple axes, based on fold or sequence homology, ability to bind small molecule kinase inhibitors and extent of functional analysis. An expansive list is most likely useful for the kinome-wide activity profiling that is a routine part of kinase-focused drug discovery. Profiling typically involves screening compounds against panels of recombinant enzymes (e.g. KINOMEscan (Posy et al., 2011)) or chemo-proteomics in which competitive binding to ATP-like ligands on beads (so-called kinobeads (Klaeger et al., 2017) or multiplexed inhibitor beads -MIBs (Cousins et al., 2018)) is assayed using mass spectrometry. In contrast, screens for kinases that phosphorylate a specific protein sequence would logically focus on enzymes known, or likely to have peptide-directed phosphotransferase activity. Discovery programs might logically be directed at the understudied kinases expressed, and potentially functional, in normal cellular physiology and in disease.
In this perspective we generate new lists for membership in the full kinome based on the published literature and a variety of inclusion and exclusion criteria. We also re-compute membership in the understudied dark kinome using the automatic network knowledge assembling machine INDRA (Gyori et al., 2017). We consolidate available data on dark kinase activity and function with the goal of determining which understudied kinase merit additional attention. Functional evidence in this context is typically indirect, such as data from TCGA (The Cancer Genome Atlas, (Weinstein et al., 2013) on the frequency with which a kinase is mutated in a particular type of cancer. In aggregate, the evidence strongly suggests that the understudied kinome is likely to contain multiple enzymes worthy of in-depth study, a subset of which may be viable therapeutic targets. All of the information in this manuscript is available in supplementary materials and is currently being curated and released via the dark kinome portal (https://darkkinome.org/).

The composition of the human kinome
An initial list of human kinases was obtained from Manning et al. (Manning et al., 2002) (referred to below as 'Manning') and a second from Eid et al. (Eid et al., 2017a) (via the Kinhub Web resource); a list of dark kinases according to IDG was obtained from the NIH solicitation (updated in January 2018) and a fourth list of all 684 proteins tagged as "kinases" was obtained from UniProt; this list includes protein kinases, lipid kinases and other small molecule kinases. These lists are overlapping but not identical ( Figure 1A). For example, eight IDG dark kinases absent from Manning and Kinhub (CSNK2A3, PIK3C2B, PIK3C2G, PIP4K2C, PI4KA, PIP5K1A, PIP5K1B, and PIP5K1C) are found in the UniProt list. We therefore assembled a superset of 710 domains (the "extended kinome") and used curated alignment profiles (HMM models) and structural analysis (Kannan et al., 2007) to subdivide domains into three primary categories: "Protein Kinase Like" (PKL), if the kinase domain was similar to known protein kinases in sequence and 3D-structure; "Unrelated to Protein Kinase" (uPK), if the kinase domain was distinct from known protein kinases; and "Unknown" if there was insufficient information to decide (see methods), (Kannan et al., 2007). PKLs were further subdivided into eukaryotic protein kinases (ePKs, discussed in the introduction), eukaryotic like kinases (eLKs) and kinases with an atypical fold (aPKs) as previously described (Kannan and Neuwald, 2005;Kannan et al., 2007). ePKs and eLKs share detectable sequence similarity in the ATP binding lobe and some portions of the substrate binding lobe (up to the conserved F-helix (Kannan et al., 2007)). aPKs, on the other hand, display no significant sequence similarity to ePKs and eLKs, but nevertheless adopt the canonical protein kinase fold. Most aPKs lack the canonical F-helix aspartate in the substrate binding lobe, but share structural similarities with ePKs and eLKs in the ATP binding lobe ( Figure 1B). Unfortunately, the nomenclature used in making these distinctions is not consistent across sources. In this perspective aPK refers to a subset of PKLs defined by fold and sequence similarity; this is distinct from the so-called "atypical protein kinase group" (AKGs). These domains are usually depicted alongside the familiar Coral kinase dendogram (Metz et al., 2018) and include protein kinases such as ATM and ATR as well as bromo-domains and TRIM proteins (see below).
As noted previously (Garrett et al., 2011;Manning et al., 2002), structural, sequence-based and functional classifications of kinases are often ambiguous and overlapping. For example, the ATM aPK is known to phosphorylate proteins DYRK2, MDM2, MDM4 and TP53 (Jassal et al., 2020) when activated by DNA double-strand breaks and it is also a member of the six-protein family of phosphatidylinositol 3-kinase-related protein kinases (PIKKs). The PIKK family has a protein fold significantly similar to lipid kinases in the PI3K/PI4K family but PI4K2A, for example, modifies phosphatidylinositol lipids and not proteins (Baumlova et al., 2014). Thus, even after extensive computational analysis, some manual curation of the kinome is necessary. We have created a sortable table enumerating all of the inclusion and exclusion criteria for individual kinases described in this perspective; it is possible to generate a wide variety sublists from this table based on user-specific criteria (Supplemental Table S1).
One drawback of the 710 extended kinome set is that it is substantially larger than the 525-550 domains commonly regarded as comprising the set of human protein kinases. In many cases, it is unknown if proteins in the extended list have experimentally-validated phospho-transfer activity, and if so whether it is directed against peptides, small molecules or both (Lu and Hunter, 2018). We therefore created a second "curated kinome" comprising 557 domains (544 genes) that includes all 556 PKLs plus the uPK STK19 (Supplemental Table S2); this list omits 15 uPKs found in Manning and 22 found in Kinhub (including multiple TRIM family proteins (Reymond et al., 2001) that regulate and are regulated by kinases (Ozato et al., 2008), but have no known intrinsic kinase activity). The shorter list also omits bromodomains. The curated 557-domain kinome and the Manning list are compared in Figure 1C and The utility of the extended kinome to drug discovery involving kinase inhibitors can be demonstrated by re-analysis of a large-scale chemo-proteomic dataset collected using multiplexed inhibitor beads (Klaeger et al., 2017). Overall, we found that 48 domains in the extended kinome list and not in the curated list bound to kinobeads and eight were competed-off in the presence of a kinase inhibitor, the criterion for activity in this assay ( Figure S1B). Pyridoxal kinase (PDXK) and adenosine kinase (ADK) were among the enzymes bound by kinase inhibitors, even though these proteins are not conventionally considered when studying kinase inhibitor mechanism of action. Because non-protein kinases participate in metabolic pathways and signaling networks, an expansive list including these nonprotein kinases facilitates a systemic analysis of the mechanisms of action (MoA) of kinase inhibitors.
Thus, extended and curated kinomes and their different sublists are useful in different settings.

Identifying understudied kinases
The original IDG dark kinome list was assembled using a bibliometric tool, TIN-X (Cannon et al., 2017), that uses natural language processing (NLP) of PubMed abstracts to assess the "novelty" and INDRA differs from simpler bibliometric tools because it is able to homogenize and disambiguate biological entities from different sources and maximize the extraction of mechanistic information. This is particularly important when a protein has multiple names, or its name changes over time. For example, the dark kinase PEAK3 was originally known as C19orf35 and was recently found to be a biologically active pseudokinase (Lopez et al., 2019). INDRA consolidates biological statements using both PEAK3 and C19orf35 as identifiers and represents them with the now-standard HGNC name PEAK3. Whenever the information is available, INDRA statements are detailed with respect to molecular mechanism and they are linked to the underlying knowledge support (the database reference or citation). For example, the INDRA network for the WEE2 dark kinases (Figure 2A) includes statements such as "Phosphorylation(WEE2(), CDK1())" and "Inhibition(WEE2(), CDK1())." These machine and human-readable assertions state that WEE2 is active in mediating an inhibitory phosphorylation event on CDK1 (Figure 2A). INDRA associates each assertion with its underlying evidence (including database identifiers or specific sentences extracted from text and their PMIDs).
INDRA also consolidates overlapping and redundant information: in many cases a single assertion has multiple pieces of evidence (for example, three PMID citations for the phosphorylation reaction described above). Each INDRA statement is therefore a unique biochemical mechanism of action rather than a paper count. INDRA Statements can be visualized as networks of mechanisms comprising proteins, small molecules and other biological entities. Thus, INDRA can be used to efficiently explore available information on proteins and protein families.
We generated INDRA networks for all members of the curated kinome and used the number of mechanistic statements as a quantitative measure of knowledge about each kinase; these networks can be visualized via the NDEx service (Pratt et al., 2015). We found that prior knowledge about the curated kinome as extracted by INDRA varied by >104 fold and was correlated with the TIN-X "novelty" score (Pearson's correlation coefficient=0.81). There were some cases in which the two measures were discordant; for example, PI4K2A has only 78 INDRA statements, but a high TIN-X novelty score of ~808. The reason for this inconsistency is still under investigation but in general, such errors reflect the difficulty of linking common names for genes and proteins to their unique identifiers in resource such as HGNC (this is known as the process of entity grounding); INDRA has extensive resources to correctly ground entities and resolve ambiguities. For example, it correctly associates MEK kinase with the HGNC name MAP2K1 and not "methyl ethyl ketone".
To estimate the intensity of drug development for each kinase we used the Small Molecule Suite based on available data on the absolute binding affinity (typically obtained from enzymatic or quantitative protein binding assays), differential "on target" affinity as compared to the "off-target affinity" (typically obtained from a kinase profiling assay), the p-value between the distributions for "on" and "off" targets, and "research bias"; the latter accounts for differences in available binding data.
In the absence of a bias estimate, a poorly studied compound can appear much more selective than a well-studied one simply because few off-targets have been tested. The MS assertion is assigned to compounds that have an absolute affinity <100 nM, an on-target Kd > 100 times lower than off target Kd, p-value of ≤ 0.1, and research bias ≤ 0.2 (see (Moret et al., 2019) for details). The SS assertion is about 10-fold less strict with regard to absolute and differential affinity (see methods). We found that kinases that were more heavily studied were more likely to have inhibitors classified as Tclin and Tchem in PHAROS or MS or SS in Small Molecule Suite. However, a substantial number of kinases with high INDRA scores are bound only by relatively non-selective inhibitors and therefore represent opportunities for development of new chemical probes.
The original NIH IDG dark kinase list encompassed approximately one-third of the kinome.
Using INDRA and TIN-X scores, we generated a new list of similar scope (schematized by the magenta box in Figure 2A, 2B) of the 182 least-studied domains in 181 proteins in the curated kinome, of which 119 were on the original NIH list and 156 in Manning or KinHub ( Figure 2D). In the analysis that follows we use this recomputed list to define the "dark kinome". When the distribution of dark kinases is viewed using the standard Coral kinase dendrogram (Metz et al., 2018), an even distribution is observed across subfamilies, with the exception that only eight tyrosine kinases are judged to be understudied ( Figure 3). In many cases light and dark kinases are intermingled on the dendrogram (e.g. the CK1 subgroup) but in some cases an entire sub-branch is dark (e.g. a branch with four TSSK and another with three STK32 kinases; dashed red outline). In yet other cases, a well-studied kinase is closely related to a dark kinases, SIK1 and SIK1B or WEE1 and WEE2 for example, but it is unknown whether such pairs of isozymes are similar or redundant functionally.

Evidence for dark kinase expression and function
To consolidate existing data on the expression and possible functions of understudied kinases, Based on RNASeq data, non-dark and dark kinases were observed to vary substantially in abundance across 1019 CCLE cell lines. Using the common threshold of RPKM ≥1 (Reads Per Kilobase of transcript, per Million mapped reads) (Kryuchkova-Mostacci and Robinson-Rechavi, 2017) evidence of expression was found in at least one cell line for 176 of 181 dark kinases ( Figure 4A). Some dark kinases were as highly expressed as well-studied light kinases: for example, NRBP1 and PAN3 and the lipid kinases PI4KA and PIP4K2C all had maximum expression levels similar to that of the abundant and well-studied LCK tyrosine kinase. Overall, however, dark kinases had significantly lower maximum mRNA expression levels than well studied kinases by multiple criteria (2.1 vs 5.8 RPKM median expression level, p-value=4.6x10-8; 36 vs 71 RPKM maximum expression level, p-value=2.2x10-16 by Wilcoxon rank sum test). In CCLE proteomic data we observed that 367 kinases from the curated kinome were detected at the level of at least one peptide per protein; 110 of these were dark kinases. The difference between proteomic and mRNA data overall is likely to reflect the lower sensitivity of shotgun proteomics, but some kinases might also be subjected to translational regulation. Analysis of DepMap data showed that 10 dark kinases are essential in at least 1/3 of the 625 cell lines tested to date ( Figure   4B; dark blue shading), and 88 kinases are essential in at least two lines (light blue shading). We conclude that a substantial number of dark kinases are expressed in human cells lines and a subset are required for cell growth. These data are likely to underestimate the breadth of kinase expression and function: proteins can impact cellular physiology when expressed at low levels (not detectable by shotgun proteomics) and genes can have important functions without necessarily resulting in growth defects assayable by DepMap methodology.

Dark kinases in disease
To study the possible roles of dark kinases in pathophysiology, we mined associations between diseases and either gene mutations or changes in expression. We examined The Cancer Genome Atlas (TCGA), Accelerating Medicines Partnership -Alzheimer's Disease (AMP-AD) and a microarray dataset on changes in gene expression associated with chronic obstructive pulmonary disease (COPD; (Rogers et al., 2019). COPD progressively impairs a patients' ability to breathe and is the third leading cause of death in the US. In TCGA, we compared the frequency of mutations in dark and non-dark kinases under the assumption that the two sets of kinases are characterized by the same ratio of passenger to driver mutations (Garraway and Lander, 2013) and we then looked for differential RNA expression relative to matched normal tissue ( Figure 5A). In common with most TCGA analysis, mutations and differential expression were scored at the level of genes and not domains; thus, observed mutations may affect functions other than kinase activity. We performed differential expression and mutation frequency analyses for individual tumor types and for all cancers as a set (the PanCan set).
With respect to differential gene expression, we found that dark and light kinases are equally likely to be over or under-expressed in both PanCan data and in data for specific types of cancer (in a Rank-sum two-sided test with H0 = light and dark kinases have similar aberrations p=0.86) ( Figure 5). For example, in colorectal adenocarcinoma, the dark kinase MAPK4 is one of the three most highly downregulated kinases whereas LMTK3, NEK5 and STK31 represent four of the seven most high upregulated kinases ( Figure S3A). This is consistent with a report that overexpression of STK31 can inhibit the differentiation of colorectal cancer cells. (Fok et al., 2012).
By mining PanCan, we also found that five dark kinases were among the 30 most frequently mutated human kinases; for example, the ~3% mutation frequency of the dark MYO3A kinase is similar to that of the oncogenic RTKs EGFR and ERBB4 (but lower than the ~12% mutation frequency for the lipid kinase PIK3CA) ( Figure 4B). Similarly, in diffuse large B-cell lymphoma (DLBCL), the dark kinase ITPKB, which has been reported to phosphorylate inositol 1,4,5-trisphosphate and regulate B cell survival (Schurmans et al., 2011), is more frequently mutated than KDR (~13% vs. 8% of patient samples, Figure S3B). Overexpression of KDR is known to promote angiogenesis and correlate with poor patient survival (Gratzinger et al., 2010;Holmes et al., 2007;Jørgensen et al., 2009). More recently a case study reported that patients carrying an ITPKB C873F substitution mutation had Richter's syndrome (which is characterized by sudden development of B cell chronic lymphocytic leukemia into a faster-growing and more aggressive DLBCL) and clones harboring the ITPKB C873F mutation exhibited higher growth rates (Landau et al., 2017). Recurrent mutation, over-expression and underexpression in TCGA data is not evidence of biological significance per se, but systematic analysis of TCGA data has been remarkably successful in identifying genes involved in cancer initiation, progression, and drug resistance. Our analysis shows that dark kinases are nearly as likely to be mutated or differentially expressed in human cancer as their better studied non-dark kinase homologues, making them good candidates for future testing as oncogenes or tumor suppressors.
To explore the roles of dark kinases in other diseases, we analyzed data from the AMP-AD program Target Discovery and Preclinical Validation (Hodes and Buckholtz, 2016)). This large program aims to identify molecular features of AD at different disease stages. We compared mRNA expression at the earliest stages of AD to late-stage disease in age matched samples ( Figure 5C) and found that the dark kinases ITPKB and PKN3 were among the five most upregulated kinases while NEK10 was substantially downregulated. A similar analysis was performed for COPD, based on a study by Rogers et al (Rogers et al., 2019) of five COPD microarray datasets from Gene Expression Omnibus (GEO) and two COPD datasets from ArrayExpress that aimed to identify genes with significant differential expression in COPD. By comparing the expression of genes in COPD patients to gene expression in healthy individuals, Rogers et al. identified genes significantly up and down regulated in patients (adjusted p-value < 0.05). We analyzed these data and found that the dark kinase PIP4K2C, which is potentially immune regulating (Shim et al., 2016), was significantly downregulated in individuals with COPD (adjusted p-value = 0.048). Additionally, CDC42BPB, nominally involved in cytoskeleton organization and cell migration (Tan et al., 2008(Tan et al., , 2011, was significantly upregulated (adjusted p-value = 0.026) ( Figure 5D). In total, five dark kinases versus fifteen non-dark kinases were differentially expressed in COPD patients. As additional data on gene expression and mutation become available for other diseases, it will be possible further expand the list of dark kinases potentially implicated in human health.

A dark kinase network regulating the cell cycle
Inspection of INDRA networks revealed that dark kinases, like well-studied kinases, function in networks of interacting kinases. One illustrative example involves control of the central regulator of cell cycle progression, CDK1, by the dark kinases PKMYT1, WEE2, BRSK1 and NIM1K (Figure 6).
Although the homologues of some of these kinases have been well studied in fission and budding yeast, not as much is known about the human kinases (Wu and Russell, 1993). WEE2, whose expression is described to be oocyte-specific (Sang et al., 2018) (but can is also be detected in seven CCLE lines, six from lung cancer and one from large intestine) is likely to be similar in function to the well-studied and widely-expressed homologue WEE1,which phosphorylates CDK1 on the negative regulatory site T15  (Matheson et al., 2016), and these molecules are likely to inhibit WEE2 as well. It is remarkable that enzymes so closely associated with the essential cell cycle regulator CDK1, remain relatively understudied in humans (Wu and Russell, 1993). This is particularly true of PKMYT1 and NIM1K which are frequently upregulated in TCGA data.

Inhibition of dark kinases by approved drugs
Kinase inhibitors, including those in clinical development or approved as therapeutic drugs, often bind multiple targets. We therefore asked whether dark kinases are targets of investigational and FDAapproved small molecules by using the selectivity score (Moret et al., 2019) to mine public data for evidence of known binding and known not-binding. We identified 13 dark kinases that may be inhibited by approved drugs and an additional 12 dark kinases for which MS or SS inhibitors exist among compounds that have entered human trials (although several of these are no longer in active development). For example, the anti-cancer drug sunitinib is described in the Small Molecule Suite database as binding to the dark kinases STK17A, PHGK1 and PHGK2 with binding constants of 1 nM, 5.5 nM and 5.9 nM respectively ( Figure 7A, Table S3) as opposed to 30 nM to 1 µM for VEGF receptors (the KDR, FLT1 and FLT4 kinases) and 200 nM for PDGFRA, which are well established targets for sunitinib. Follow-on biochemical and functional experiments will be required to determine if dark kinases play a role in the therapeutic mechanisms of these and other approved drugs.
The potential for development of new compounds that inhibit dark kinases based on modification of existing kinase inhibitors can be assessed in part by examining the structures of kinase binding pockets using Bayes Affinity Fingerprints (BAFP) (Bender et al., 2006;Nguyen et al., 2013). In this cheminformatics approach, each small molecule in a library is computationally decomposed into a series of fragments using a procedure known as fingerprinting. The conditional probability of a compound binding to a specific target (as measured experimentally in profiling or enzymatic assays) given the presence of a chemical fragment is then calculated. Each target is thereby associated with a vector comprising conditional probabilities for binding fragments found in the fingerprints of compounds in the library. Subsequently, the correlation of conditional probability vectors for two proteins is used to evaluate similarity in their binding pockets from the perspective of a chemical probe. BAFP vectors were obtained from a dataset of ~5 million small molecules and 3000 targets for which known binding and non-binding data are available from activity profiling.
We found by BAFP that the majority of kinase domains fell in two clusters, each of which had multiple dark and non-dark kinases. The close similarity of dark and non-dark kinases in "compound binding space" suggests that many more kinase inhibitors than those described in Figure 7a may already bind dark kinases or could be modified to do so ( Figure 7B, Figure S5). For example, the clustering of IRAK1, IRAK4, STK17B and MAP3K7 by BAFP correlation (highlighted in Figure 7B) demonstrates that the STK17B binding pocket is likely very similar to that of IRAK1, IRAK4 and MAP3K7 and that compounds binding these non-dark kinases, such as lestaurtinib and tamatinib may also bind STK17B.

DISCUSSION
In this perspective we revisit the criteria used to define membership in the human kinome. This is not a trivial task because no single functional, structural or historical definition exists. We have therefore assembled a table of protein domains that can be used to generate more or less expansive sets of kinases in a data-driven manner based on criteria such as function, sequence homology, known or predicted structure etc, (see Supplementary Table 1 The human kinome includes ~50 so-called "pseudokinases" that, based on sequence alignment, lack one or more residues generally required for catalytic activity. These residues include the ATPbinding lysine (K) within the VAIK motif, the catalytic D within the HRD motif and the magnesium binding D within the DFG motif (Kwon et al., 2019). Many pseudokinases function in signal transduction despite the absence of key catalytic residues. For example, the EGFR family member ERBB3/HER3 is a pseudokinase that, when bound to ERBB2/HER2, forms a high affinity receptor for heregulin growth factors (Sliwkowski et al., 1994). ERBB3 over-expression also promotes resistance to therapeutic ERBB2 inhibitors in breast cancer (Garrett et al., 2011). Some proteins commonly annotated as pseudokinases have been found to have phospho-transfer activity. Haspin, for example, is annotated as a pseudokinase in the ProKinO database because it lacks a DFG motif in the catalytic domain, but it has been shown to phosphorylate histone H3 using a DYT motif instead (Eswaran et al., 2009;Villa et al., 2009). H3 phosphorylation by Haspin changes chromatin structure and mitotic outcome and is therefore physiologically important (Dai et al., 2005). The existence of biologically active pseudokinases, some of which may actually have phosphotransferase activity, is but one way in which equating kinases with a specific fold, sequence similarity, or enzymatic function is inadequate.
Based on our work, the most useful definitions of the human kinome are likely to be an expansive 710 domain "extended kinome" that broadly encompasses related sets of folds, sequences and biological functions. This list will be relevant to machine learning, chemoproteomics, small molecule profiling and genomic studies in which an expansive view of the kinome is advantageous. As one example of such a list we generated a set of 557 "curated kinase" domains that is most similar in spirit to In the course of preparing this perspective we repeatedly ran into the challenge of correctly associating kinases described in the literature with their canonical (HGNC or UniProt) names. Many kinases have multiple names, often several common ones such as MEK1 or MAPKK1 as well as a standardized one such as MAP2K1 (HGNC:6840). In many cases, the name space changes over time (e.g. PEAK3 instead of C19orf35 (Lopez et al., 2019)). Humans find it arduous to make these associations across a vast literature and, even after extensive human training, tools based on state of the art NLP such as INDRA cannot correctly ground all named entities. We have also observed that many of the regulatory sites on kinases are mis-numbered, either because residue number changed over time or confusion over isoforms and even species (Bachman et al., 2019). This leads to the problem of "unknown knowns"namely facts that have been established (or data that have been collected) but are no longer findable by the community. One of the tasks of the IDG group is therefore to identify such sources of "lost" information and correctly associate them with systematic knowledge repositories. In the specific case of the dark kinome described here, we welcome information on data we might have missed on specific kinases.
We find that at least 175 of the least studied kinases as defined above, are expressed in CCLE cell lines (the largest available cell line panel analyzed to date) when measured by protein or by mRNA expression (and in many case, by both). We also find that half of all dark kinases are essential in two of more of the 625 cell lines annotated in the DepMap (Tsherniak et al., 2017) and 10 are essential in at least two-thirds of DepMap lines. In addition, 27 kinases are among the top ten most mutated kinase in one or more cancer types annotated in TCGA and several others are differentially expressed in disease databases for Alzheimer's Disease or COPD. These largely indirect findings suggests that a substantial subset of dark kinases are functional in normal physiology and in disease. This information is of immediate use in studying protein phosphorylation networks and it sets the stage for studies using genetic and chemical tools to understand dark kinase function. Based on available evidence, the possibility exists that some dark kinases may also be valuable as therapeutic targets.

OUTSIDE INTERESTS
PKS is a member of the SAB or Board of Directors of Applied Biomath and RareCyte Inc and has equity in these companies. In the last five years the Sorger lab has received research funding from Novartis and Merck. Sorger declares that none of these relationships are directly or indirectly related to the content of this manuscript. Other authors declare that they have no outside interests.

Classification of the "extended kinome" and defining the "curated kinome"
To obtain a list of kinases from UniProt all human proteins annotated to have kinase activity were extracted and filtered based on (i) interaction with ADP/ATP; (ii) presence of a kinase domain; 3) membership in a kinase family (lists of kinase domains and kinase families are available in supplementary material). To identify human kinase sequences that belong to the Protein Kinase Like (PKL) fold, 710 sequences annotated as "kinase" in UniProt were first subjected to a similarity search against well curated ePK profiles to identify and separate out the 8 canonical ePK groups (Eswaran et al., 2009;Manning et al., 2002;Talevich et al., 2011). eLKs were identified based on detectable sequence similarity with one or more of the ePK sequences. Sequences that share no detectable sequence similarity to ePKs were classified as aPKs. For predicted aPKs, crystal structures of the protein itself or of the closest homolog were inspected manually to check if the kinase domain adopts a canonical ePK fold. Additional support for this classification was obtained by calculating a Hidden Markov Model (HMM)-based distance score between the Pfam domains (Huo et al., 2017) and the presence/absence of key structural features distinguishing ePKs, eLKs and aPKs, as described previously (Kannan and Neuwald, 2005;Kannan et al., 2007). A subset of sequences that satisfied none of the above criteria i.e. no detectable sequence similarity to ePKs, no clear kinase function and no homologous crystal structures, were grouped into the unknown protein kinase category (uPKs). All kinases annotated to have a PKL fold were included in the curated kinome. STK19 was also included in the curated kinome despite its uPK fold since it is known to be serine/threonine kinase active against peptide substrates (Yin et al., 2019).

Curation of INDRA statements and generation of INDRA networks
INDRA uses natural language processing (NLP) to extract mechanistic information from literature as well as databases and represents them in a standardized format as previously described (Gyori et al., 2017). In the present study, mechanistic statements for each kinase were obtained from INDRA with the script 'get_kinase_interaction.py'. The number of INDRA statements were counted for each kinase.
Regulatory networks were generated by first assembling a mechanistic model for each kinase with the INDRA assembler.cx module and uploading the model to NDex (python scripts to assemble INDRA statements and assemble mechanistic networks are available on the Github repository http://github.com/labsyspharm/dark-kinomes).

Small molecule selectivity calculations
The specificity of small molecules was calculated according to the selectivity score(Moret et al., 2019), which uses multiple parameters to assess selectivity: (i) the absolute affinity for the 'on' target; ii) the differential affinity between the 'on' and 'off' targets of each kinase; (iii) the p-value of the difference between the distributions of 'on' and 'off' targets; (iv) the research biasa score indicating how broadly a compound has been tested for off-targets. The selectivity score was divided in four tiers; Most Selective (MS), Semi Selective (SS), Polyselective (PS) and Unknown (UN). MS levels are defined as an absolute affinity of Kd <100 nM (at least two measurements) ; a differential affinity of 100 (i.e. the affinity of the compound for the 'on' target is 100 times greater than for the 'off' targets), a p-value ≤ 0.1 and a research bias <0.2; SS levels are defined as an absolute affinity of Kd<1 µM (at least 4 measurements), a differential affinity of 10, a p-value ≤0.1 and research bias <0.2; PS levels are defined as an absolute affinity Kd< 9000 nM, differential affinity of 1 (e.g. equal affinity for 'on' and 'off' targets) and research bias <0.2; UN levels are defined as an absolute affinity Kd< 9000 nM and differential affinity of 1.

CCLE analysis
The data RNA dataset 'CCLE_RNAseq_genes_rpkm_20180929.gct.gz' was downloaded from the CCLE portal (https://portals.broadinstitute.org/ccle/data) and analyzed with the script "analyzing_CCLE_data.r". The maximum expression value over all cell lines was calculated and plotted ( Figure 3A). Genes were considered 'expressed' if the maximum RPKM was ≥1. The mass spectrometry dataset 'protein_quant_current_normalized.csv' was downloaded from the DepMap portal (https://depmap.org/portal/download/) and analyzed with the script "analyzing_CCLE_data.r". Proteins for which one or more peptides were detected in this dataset were considered to be expressed.

Determination of Essential Kinases through Dependency Map
The preprocessed results of genome-wide CRISPR knockout screens were obtained from the DepMap 19Q4 Public data release (https://depmap.org/portal/download/). The results of the screens were processed as described by Dempster et al (Dempster et al., 2019). For each kinase, cell lines with a CERES score >0.5 were classified as dependent and the number of dependent cell lines for each kinase was then tallied.

TCGA analysis
TCGA PanCan gene expression and mutation frequency data was obtained from cBioPortal (Cerami et al., 2012;Gao et al., 2013). To identify kinases with abnormal expression in tumors, tumor types with at least 10 paired normal tissue samples were analyzed. For each kinase, the fold change of its median expression in either all tumor tissues (general PanCan analysis) or the individual tumor tissue over its median expression in the paired healthy tissues was calculated. P-value from Wilcoxon-Mann-Whitney two-sided test was calculated based on the distributions of gene expression in tumor and healthy tissues in each tumor type. Adjusted p-values were calculated using the Benjamini-Hochberg procedure. To identify kinases heavily mutated in cancer, the number of patient samples with mutation or gene fusion was counted and normalized to the total number of patient samples (10953 samples).

AMP-AD analysis
Preprocessed count matrices of AMP-AD consortium RNA-seq data were downloaded from the AMP-AD Synapse directory([CSL STYLE ERROR: reference with no printed form.]). In summary, these counts were derived from raw reads using the STAR aligner (Dobin et al., 2013) and the Gencode v24 human genome annotation. In our analysis, we included all Alzheimer's disease (AD) patients from the

Mount Sinai VA Medical Center Brain Bank (MSBB) and the Religious Orders Study and Memory and
Aging Project (ROSMAP) study (Mostafavi et al., 2018) for which RNA-seq data from post-mortem brain was available and their age at death and Braak stage were known. Differential expression analysis was performed using the R package DESeq2(Love et al., 2014). We fitted a generalized linear model to the expression of each gene using the Braak stage as independent variable and adjusted for age at death and study batch effect by including them as covariates. We used the Wald test implemented in DESeq2 to extract differentially expressed genes between early (Braak stages 1 and 2) and late (5 and 6) AD cases. Effect sizes were moderated using the R package apeglm(Zhu et al., 2019).

COPD differential expression analysis
Preprocessed dataset combining 5 datasets from GEO and 2 from ArrayExpress was downloaded from https://figshare.com/articles/Meta-analysis_of_Gene_Expression_Microarray_Datasets_in_Chronic_Obstructive_Pulmonary_Disease/8233 175. Data was preprocessed as described in Rogers et al(Rogers et al., 2019). Raw expression data was processed by generalized least squares (GLS) weighted models to account for heterogeneity between datasets. A Likelihood Ratio Test was used to identify differentially expressed genes. Genes with significant (adjusted p-value <0.5) differential expression in COPD versus healthy individuals and that are within the two-tailed 10% and 90% quantile were identified as genes of interest. Relative expression of these differentially expressed genes was calculated as the effect size of the GLS estimates of the individuals with COPD and healthy individuals.

Figure 3 -Dark kinases on the Coral kinase dendogram
Kinases from the curated kinome are visualized on the Coral kinase dendrogram (Metz et al., 2018). The recomputed dark kinome is shown in blue and non-dark kinases are shown in yellow. The atypical kinase group (AGC; denoted by a blue dashed line) as previously defined by Manning and KinHub lies to the right of the dendogram; this set includes multiple genes that are not considered to be members of the curated kinase family as described in this paper (labelled in gray). The 46 kinases in the curated kinome but not on the Coral dendrogram are listed separately to the right and organized by protein fold.
Red dashed lines denote regions of the dendogram in which all kinases are dark.        COPD: value of 0 or 1 denoting whether the kinase is differentially expressed in COPD patients).