Abstract
Intrinsically disordered regions (IDRs) are important functional modules of several proteins, providing extra layers of regulation as switchable structural elements. Protein disorder has been associated with cancer, but it is unknown whether IDR mutations represent a distinct class of driver events associated with specific molecular mechanisms and system level properties, which would require dedicated targeting strategies. Based on an integrative computational approach, we identified 47 IDRs whose genetic mutations can be directly linked to cancer development. While not as common as alterations of globular domains, IDR mutations contribute to the emergence of the same cancer hallmarks through the modulation of distinct molecular mechanisms, increased interaction potential and specific functional repertoire. We demonstrate that in specific cancer subtypes, IDR mutations represent the key events driving tumorigenesis. However, treatment options for such patients are currently severely limited. We suggest targeting strategies that could enable successful therapeutic intervention for this subclass of cancer drivers, extending our options for personalized therapies.
Introduction
Tumorigenic variations at the genome level manifest in changes in protein structure, availability, localization, turnover, or stability. The structural and functional properties of the affected proteins determine their oncogenic or tumor suppressor roles, which, in the case of context-dependent genes, can also depend on tissue type or the stage of tumor progression. Understanding how these tumorigenic roles emerge from specific mutations is essential for subsequent drug development efforts. Genetic variations have been collected for tens of thousands of human cancer incidences and can be accessed via public repositories, such as TCGA (https://cancergenome.nih.gov/), or the COSMIC database, incorporating data from targeted studies as well(Tate et al., 2019). These data revealed that cancer samples are extremely heterogeneous both in terms of the number and type of genetic alterations. However, various patterns start to emerge when these samples are analyzed in combination(Cancer Genome Atlas Research Network et al., 2013), enabling the identification of cancer driving genes that are frequently mutated in specific types of cancer(Lawrence et al., 2014, 2013), highlighting biological processes/pathways that are commonly altered in tumor development(Ali and Sjöblom, 2009; Copeland and Jenkins, 2009) and traits that govern tumorigenic transformation of cells(Hanahan and Weinberg, 2011). Recent analyses estimated the number of driver genes to be in the low to mid-hundreds(Bailey et al., 2018), but this number could increase with growing number of sequenced cancer genomes(Lawrence et al., 2014).
Recent computational methods can not only identify cancer genes, but also highlight specific functional modules that are critical for tumorigenesis(Buljan et al., 2018; Mészáros et al., 2016; Porta-Pardo and Godzik, 2014; Tamborero et al., 2013), identifying oncogenes that are altered via activating mutations, but also the majority of tumor suppressors that are typically deactivated by truncating mutations(Buljan et al., 2018; Mészáros et al., 2016). The positional accumulation of mutations within specific protein regions has been analyzed for structures, domains, or interactions surfaces(Engin et al., 2016; Kamburov et al., 2015; Porta-Pardo et al., 2015; Tokheim et al., 2016; Yang et al., 2015). However, a sizeable portion of human proteins corresponds to protein regions that function without inherent structure(Dyson and Wright, 2005; van der Lee et al., 2014). These intrinsically disordered proteins/regions (IDPs/IDRs) are predicted to represent around 30% of all residues of human proteins(Ward et al., 2004) and are also important for the interpretation of the effect of various disease mutations(Vacic et al., 2012).
The lack of an inherent structure has a profound effect on the way IDPs carry out their functions(Tompa, 2002; van der Lee et al., 2014). IDPs can act as flexible linkers or entropic chains, directly exploiting their conformational heterogeneity. IDPs also often interact with other biomolecules with interaction sites that are usually short and linear. These regions serve as recognition sites for specific protein domains(Davey et al., 2012) or nucleic acids(Staby et al., 2017), be sites of post-translational modifications(Darling and Uversky, 2018), can harbour localization signals(Eisenhaber and Eisenhaber, 2007), or mediate oligomerization modulating protein stability and functionality(Faust et al., 2014). Other types of functional IDRs cover domain sized regions, serving as assembly sites for larger complexes(Cortese et al., 2008). In general, IDPs are core components of interaction networks and fulfill critical roles in regulation and signaling(Wright and Dyson, 2015). In accord with their crucial functions, IDPs are often associated with various diseases(Babu et al., 2011), in particular with cancer. IDRs can be integral parts of both oncogenes – such as β-catenin(Morin et al., 2016) – and tumor suppressors – such as p53(Olivier et al., 2010). The prevalence of protein disorder among cancer-associated proteins was also observed at a more general level(Iakoucheva et al., 2002). A direct link between protein disorder and cancer was suggested in the case of two common forms of generic alterations; chromosomal rearrangements(Hegyi et al., 2009) and copy number variations(Vavouri et al., 2009). In contrast, a recent analysis found that cancer-associated missense mutations had a preference for ordered regions, and suggested that the association between protein disorder and cancer could be indirect(Pajkos et al., 2012). Nevertheless, a direct link between disordered regions and mutations were also suggested either through the abolishment(Uyar et al., 2014) or the creation(Meyer et al., 2018) of IDR-mediated interactions, but only in a few cases. In general, the extent to and the mechanisms through which disordered protein regions directly drive cancer are still largely unexplored.
The identification of functional modules that are directly altered in cancer driver genes can serve with potential targets for pharmaceutical intervention. Most current anticancer drugs are inhibitors designed against enzyme activity (using either competitive or noncompetitive inhibition)(Griffith et al., 2010; Pathania et al., 2018; Scatena et al., 2008). In general, currently successful drug development efforts mainly focus on ordered protein domains, in the framework of structure- based rational drug design(Lounnas et al., 2013). However, IDPs can potentially offer new directions for cancer therapeutics(Kulkarni, 2016). Currently tested approaches include the direct targeting of IDPs by specific small compounds, or blocking the globular interaction partner of IDPs(Metallo, 2010; Neira et al., 2017). However, the direct targeting of IDPs requires radically different molecular strategies, and such approaches have yet to reach maturity.
In this work we analyzed cancer mutations from genome wide screens and targeted studies(Forbes et al., 2016) to identify significantly mutated protein regions(Mészáros et al., 2016), and classify them into ordered and disordered regions integrating experimental structural knowledge and prediction. Automated and high-quality manually curated information was gathered for the collected examples to gain better insights into their functional and network properties, and their roles in tumorigenic processes. We aimed to answer the following questions: What are the characteristic molecular mechanisms, biological processes, and protein-protein interaction network roles associated with proteins mutated at IDRs? And at a more generic level: how fundamental is the contribution of IDPs to tumorigenesis? Are IDP mutations just accessory events, or can they be the main, or even the sole molecular background to the emergence of cancer? How much can we gain by their systematic targeting efforts, which are the cancer types in which IDR mutations should be considered for large therapeutic gain, and finally, how should we select our targeting approaches for specific IDPs?
Results
1. Disordered protein modules are targets for tumorigenic mutations
Protein disorder is an integral part of many cancer driver proteins and is preferentially mutated in context-dependent genes
Here we used an integrated approach to define ordered and disordered functional modules in all human proteins, merging experimental annotations and structural predictions (see Data and methods). The majority of human proteins (Supplementary Figure S1) are modular with almost 70% of all proteins incorporating at least one disordered module. Module numbers were assessed separately for the census of cancer driver genes collected from the COSMIC database(Sondka et al., 2018) and the literature(Vogelstein et al., 2013). All census drivers were manually characterized as tumor suppressors, oncogenes and context-dependent cancer genes based on the literature (Supplementary Table S1). Census drivers have a higher average modularity, and contain a higher average number of disordered modules compared to all human proteins (Figure 1A, Supplementary Figure S2), and this modularity is even higher for context-dependent genes. The reported values are likely to be still conservative estimates of the true number of modules in proteins, as individual regions could contain yet uncharacterized tandem modules, especially in the case of disordered regions.
The abundance of IDRs in cancer drivers hint at the importance of disorder for the normal function of these proteins, but is not necessarily reflected in the distribution of cancer mutations. To analyze this, we collected exomic cancer mutations with a clearly localized effect from the COSMIC database (see Data and methods). Figure 1B shows that exomic mutations preferentially target ordered sequence regions in drivers, in line with earlier observations(Pajkos et al., 2012). This preference is most pronounced for oncogenes, for which only less than 10% of known mutations fall into IDRs. In contrast, the fraction of mutations within IDRs shows a three- fold increase in tumor suppressor genes (17%) and reaches 47% in the case of context-dependent genes. The trends remain the same using whole exome sequencing data from TCGA (Supplementary Figure S3, Data and methods).
Several cancer drivers are modulated through disordered regions
In order to explore the possible causal relationship between tumorigenesis and structural properties, we used iSiMPRe(Mészáros et al., 2016) on the pre-filtered mutation data from COSMIC to identify specific regions in proteins that are directly involved in cancer development (see Data and methods). By restricting our analysis to high-confidence cases, we identified 178 ordered and 47 disordered driver regions in 145 proteins from the human proteome (Supplementary Table S2, Figure 1C). The structural status of these regions were confirmed by manual curation. The identified driver regions typically represent compact modules, usually not covering more than 10% or 20% of the sequences in the case of oncogenes and tumor suppressors, respectively (Supplementary Figure S4). While oncogenes and tumor suppressors both incorporate disordered driver regions, these regions are most common in context-dependent drivers (Figure 1C).
According to the 20/20 rule, true oncogenes are recognizable from mutation patterns, having a higher than 20% fraction of missense point mutations in recurring positions (termed the oncogene score(Vogelstein et al., 2013)). In contrast, tumor suppressors have lower oncogene scores, and predominantly contain truncating mutations. Figure 1D shows that the 20/20 rule holds true for the vast majority of the identified region-harboring oncogenes and context-dependent genes. Figure 1D also shows the oncogene scores calculated from the identified regions alone. Despite their short relative lengths (Supplementary Figure S4), the driver regions are the main source of the oncogenic effect in almost all cases. While most drivers contain both ordered and disordered modules, oncogenesis is typically mediated through either ordered or disordered mutated regions. This effectively partitions cancer drivers into ‘ordered drivers’ and ‘disordered drivers’, regardless of the exact structural composition of the full protein. Thus, IDRs are not only essential components of drivers, but the direct modulation of these regions is heavily utilized in the emergence of cancer.
2. Disordered drivers function via distinct molecular mechanisms
Disordered drivers employ distinctive molecular mechanisms of action
Nearly all these proteins have already been identified as potential cancer drivers in the literature, and available structural and functional information can highlight the possible mechanisms of action altered in cancer (Figure 2, Supplementary Table S3), even though the information is often incomplete.
Several of the identified highly mutated disordered regions correspond to linear motifs, including sites for protein-protein interactions (e.g. hUBPy [corresponding gene: USP8], forkhead box protein O1 [FOXO1] and ER-α [ESR1]), degron motifs that regulate the degradation of the protein (e.g β-catenin [CTNNB1], cyclin-D3 [CCND3] and CSF-1R [CSF1R]) and localization signals (e.g. p14ARF [CDKN2A] and BAF47 [SMARCB1]). However, other types of disordered functional modules can also be targeted by cancer mutations. IDRs with autoinhibitory roles (e.g. modulating the function of adjacent folded domains) are represented by EZH2 [EZH2], a component of the polycomb repressive complex 2. While the primary mutation site in this case is located in the folded SET domain, cancer mutations are also enriched within the disordered C- terminus that normally regulates the substrate binding site on the catalytic domain. Another category corresponds to regions involved in DNA and RNA binding. The highly flexible C- terminal segment of the winged helix domain is altered in the case of HNF-3-α [FOXA1], interfering with the high affinity DNA binding. For the splicing factor hSNF5 [SRSF2], mutations affect the RNA binding region (Figure 2).
Larger functional disordered modules, often referred to as intrinsically disordered domains (IDDs), can also be the primary sites of cancer mutations. Mutated IDDs exhibit varied structural and sequence features. In pVHL [VHL], the commonly mutated central region adopts a molten globule state in isolation(Sutovsky and Gazit, 2004). The mutated region of APC [APC] incorporates several repeats containing multiple linear motif sites, which are likely to function collectively as part of the β-catenin destruction complex(Aoki and Taketo, 2007). In calreticulin [CALR], cancer mutations alter the C-terminal domain-sized low complexity region, altering Ca2+ binding and protein localization(Elf et al., 2016).
Linker IDRs, not diredtly involved in molecular interactions, are also frequent targets of cancer mutations. The juxtamembrane region located between the transmembrane segment and the kinase domain of Kit [KIT] and related kinases, are the main representatives of this category. Similarly, the regulatory linker region connecting the substrate- and the E2 binding domains is one of the dominant sites of mutations in the case of the E3 ubiquitin ligase c-Cbl [CBL].
One of the recurring themes among cancer-related IDP regions is the formation of molecular switches (Supplementary Table S3). The most commonly occurring switching mechanism involves various post-translational modifications (PTMs), including serine or threonine phosphorylation (e.g. cyclin-D3 [CCND3], c-Myc [MYC] and APC [APC]), tyrosine phosphorylation (e.g. c-Cbl [CBL], CD79b [CD79B], and CSF-1R [CSF1R]), methylation (e.g. histone H3s [H3F3A/H3F3B/HIST1H3B]) or acetylation (e.g. ER-α [ESR1]). An additional way of forming molecular switches involves overlapping functional modules (Figure 2). In the case of BAF47 [SMARCB1], the mutated inhibitory sequence is likely to normally mask a nuclear export signal in the autoinhibited state (Craig et al., 2002). In the case of Pax-5 [PAX5], the mutated flexible linker region is also involved in the high affinity binding of the specific DNA binding site(Garvie et al., 2001). Cancer mutations of the bZip domain of C/EBP α [CEBPA] disrupt not only the DNA binding function, but the dimerization domain as well(Paz-Priel and Friedman, 2011). In addition to their linker function, the juxtamembrane regions of kinases are also involved in autoinhibition and trans-phosphorylation, regulating degradation and downstream signaling events(Hubbard, 2004; Li and Hristova, 2010).
Different types of disordered drivers are mutated with specific mutational mechanisms
Mutational patterns are strongly associated with the role of the perturbed functional modules (see Figure 2 and online visualization links in Supplementary Table S3). Short linear motifs are typically mutated in a few key positions, often only affecting a single PTM site that plays a key role in regulating the interaction. IDDs generally show more distributed mutational patterns, in accord with their larger sizes. These regions can also be targeted by truncating mutations, which affect only the specific region, such as in the case of calreticulin [CALR] and APC [APC]. The most common mutation type for linker regions involves in-frame insertions and deletions, which alter the length of the region. These types of mutations are also common for multifunctional modules that have autoinhibitory functions and serve as linkers as well.
The collected examples of disordered regions mutated in cancer cover both oncogenes and tumor suppressors, as well as context-dependent genes. There is a slight tendency for tumor suppressors to be altered via longer functional modules, such as IDDs. Nevertheless, with the exception of linkers in tumor suppressors and IDDs in context-dependent genes, every other combination occurs even within our limited set.
3. Disordered mutations give rise to cancer hallmarks by targeting central elements of biological networks
Disordered drivers integrate biological processes through their increased interaction capacities
Almost all of the analyzed IDRs are involved in binding to a molecular partner, even some of the linkers owing to their multifunctionality. Therefore, we analyzed known protein-protein interactions of ordered and disordered cancer drivers in more detail (see Data and methods). Our results indicate that both sets of drivers are involved in a large number of interactions, and show increased betweenness values compared to average values of the human proteome, and even compared to the direct interaction partners of cancer drivers (Figure 3A). However, this trend is even more pronounced for disordered drivers. The elevated interaction capacity could also be detected at the level of molecular function annotations using Gene Ontology (see Supplementary table S4 and Data and methods). Figure 3B shows the average number of types of molecular interaction partners for both disordered and ordered drivers, contrasted to the average of the human proteome. The main interaction partners are similar for both types of drivers, often binding to nucleic acids, homodimerizing, or binding to receptors. However, disordered drivers are able to physically interact with a wider range of molecular partners, and are also able to more efficiently interact with RNA and the effector enzymes of the post-translational modification machinery. This, in particular, can offer a way to more easily integrate and propagate signals through the cell, relying on the spatio-temporal regulation of interactions via previously demonstrated switching mechanisms (Supplementary Table S3).
The high interaction capacity and central position of disordered drivers allows them to participate in several biological processes. The association between any two processes can be assessed by quantifying the overlap between their respective protein sets (see Data and methods). We analyzed the average overlap between various processes using a set of non-redundant human-related terms of the Gene Ontology (Supplementary Table S5). The average overlap of proteins for two randomly chosen processes is 0.15%, showing that as expected, in general, biological processes utilize characteristically different gene/protein sets. Restricting proteins to the identified drivers, and only considering processes connected to at least one of them, the average overlap between processes is increased to 3.00% for ordered drivers and 5.80% for disordered drivers (Figure 3C). This shows that the integration of various biological processes is a distinguishing feature of cancer genes in general, and for disordered drivers in particular; and that IDPs targeted in cancer are efficient integrators of a wide range of processes.
Disordered drivers employ characteristic molecular toolkits
Disordered and ordered drivers can employ different molecular mechanisms in order to fulfill their associated biological processes. To quantify these differences, we assembled a set of molecular toolkits integrating Gene Ontology terms (see Data and methods and Supplementary Table S6). Based on this, we calculated the enrichment of each molecular toolkit for both disordered and ordered drivers, in comparison with the full human proteome, highlighting enriched and possibly driver class-specific toolkits (Figure 3D). Receptor activity is the most enriched function for both types of drivers, owing at least partially to the fact that receptor tyrosine kinases can often be modulated via both ordered domains and IDRs (Figure 1 and Figure 2). In contrast, the next three toolkits enriched for disordered drivers are highly characteristic of them. These are gene expression regulation, the modulation of DNA structural organization, and the degradation of proteins, mainly through the ubiquitin-proteasome system. In addition, RNA processing, translation and folding is also highly characteristic of disordered drivers; and while this toolkit is not highly enriched compared to the human proteome in general, ordered drivers are almost completely devoid of this toolkit.
Modulation of disordered proteins presents an independent tumorigenic mechanism for all hallmarks of cancer
There are ubiquitously displayed features of cancer cells, often described as the ten hallmarks of cancer(Hanahan and Weinberg, 2011). In order to quantify the contribution of drivers to each hallmark, we manually curated sets of biological process terms from the Gene Ontology that represent separate hallmarks (see Data and methods and Supplementary Table S7). Enrichment analysis of these terms shows that all hallmarks are significantly over-represented in census cancer drivers compared to the human proteome (Supplementary Figure S5A), serving as a proof- of-concept for the used hallmark quantification scheme. Furthermore, comparing drivers with identified regions to all census cancer drivers shows a further enrichment (Supplementary Figure S5B), indicating that the applied region identification protocol of iSiMPRe is able to pick up on the main tumorigenic signal by pinpointing strong driver genes. Separate enrichment calculations for ordered and disordered drivers show that despite subtle differences in enrichments, in general, all ten hallmarks are over-represented in both driver groups (Figure 3E). This indicates that while the exact molecular mechanisms through which ordered domain and IDR mutations contribute to cancer are highly variable, both types of genetic modulation can give rise to all necessary cellular features of tumorigenic transformation. Hence, IDR mutations provide a mechanism that is sufficient on its own for cancer formation.
Protein, network, and cellular level attributes provide complementary information on disordered drivers
In the previous sections we elucidated the characteristic features of disordered drivers at various levels. These include five major features: the molecular mechanisms (1) at the protein level (Figure 2); the number and type of interaction partners (2), and the molecular toolkits (3) at the network/pathway level (Figure 3); the associated hallmarks (4) (Figure 3), and the tumorigenic character of the protein at the cellular level (5) (Figure 2 and Supplementary Table S3). These features provide information on the drivers at very different levels, but in theory, they might be associated in non-trivial ways. Such associations could be used to establish disease subtypes, or might direct targeting efforts. For example, a hypothetical association between SLiM mutations and the ‘Inducing angiogenesis’ hallmark would mean, that counteracting SLiM mutations would be a generic approach to solid tumor treatments. However, no strong association was found when calculating the common information content encoded in the five main features (see Data and methods, Supplementary Table S8 and Supplementary Table S9). This indicates that these five aspects of disordered drivers are highly complementary, and potentially all of them need to be considered for determining efficient targeting sites.
4. Disordered drivers are important players in cancer at the patient sample level
Cancer incidences can arise through disordered drivers
Using whole-genome sequencing data from TCGA we assessed the role of the identified drivers at the patient level. 10,197 tumor samples, containing over three and a half million genetic variations were considered, to delineate the importance of disordered drivers at the sample level across the 33 cancer types covered in TCGA. In driver region identification we only considered mutations with a local effect (missense mutations and in frame indels), which naturally yielded only a restricted subset of all true drivers. However, in patient level analyses, we also considered other types of genetic alterations of the same gene, in order to get a more complete assessment of the alteration of identified driver regions per cancer type (see Data and methods).
In spite of the incompleteness of the identified set of driver genes, we still found that on average about 80% of samples contains genetic alterations that affect at least one identified ordered or disordered driver region. Thus, the identified regions are able to describe the main players of tumorigenesis at the molecular level (Figure 4A). While at the protein level typically either ordered or disordered regions are modulated (Figure 1E), at the patient level most samples show a mixed structural background, most notably in colorectal cancers (COAD and READ). Some cancer types, however, show distinct preferences for the modulation of a single type of structural element: for thyroid carcinoma (THCA) or thymoma (THYM) the molecular basis is almost always the exclusive mutation of ordered protein regions. At the other extreme, the modulation of disordered regions is enough for tumor formation in a considerable fraction of cases of liver hepatocellular, adrenocortical, and renal cell carcinomas, together with diffuse large B-cell lymphoma (LIHC, ACC, KIRC and DLBC). These results, in line with the previous hallmark analyses, show that IDR mutations can constitute a complete set of tumorigenic alterations. Hence, there are specific subsets of patients that carry predominantly, or exclusively disordered driver mutations in their exome.
Disordered drivers display characteristic cancer-type specificity
Whole genome sequencing data was also used to assess the cancer type specificity of disordered drivers (Figure 4B). Nearly all studied cancer types have at least one disordered driver that is mutated in at least 1% of cases, with the exception of thyroid carcinoma (THCA). As such, disordered mutations play important roles in almost all cancer types, but in a highly heterogeneous way. There are only four disordered drivers that can be considered as generic drivers, being mutated in a high number of cancer types. p53 presents a special case in this regard, as it is the main tumor suppressor gene in humans, and thus is most often affected by gene loss or truncations affecting a large part of the protein. These alterations abolish the function of both the ordered and disordered driver regions at the same time (the DNA-binding domain and the tetramerization region). In contrast, the other three generic disordered drivers are predominantly altered via localized mutations in their disordered regions: the degrons of β-catenin and NRF2, and the central region of APC, and hence these are true disordered drivers. While the four generic drivers are commonly mutated in several cancer types, these cases are the exception. The majority of disordered drivers show a high degree of selectivity for tumor types, being mutated only in very specific cancers. This specificity is strongly connected to the tumorigenic roles of disordered drivers (Figure 4C). Considering 1% of patient samples as the cutoff, tumor suppressors are typically implicated in a broad range of cancer types, while oncogenes on average show a high cancer type specificity. Context-dependent disordered drivers are often mutated in only a very restricted set of cancers.
Mutation data collected in TCGA corresponds to relatively broad classes of tumor types, but does not include results of targeted studies corresponding to several rarer cancers or more specific cancer subtypes. However, for several tumor subclasses, including both malignant and benign cases, mutations in a specific disordered driver is the main, or one of the main driver events (Table 1). In the case of these tumor types, targeting disordered regions can have a potentially huge treatment advantage, and in many cases, the counteraction of these IDR mutations may be the only viable therapeutic strategy.
Cancer incidences arising through disordered drivers lack effective drugs
Next, we addressed how well disordered drivers are targetable by current FDA approved drugs, as collected by the OncoKB database(Chakravarty et al., 2017). This database currently contains 83 FDA-approved anticancer drugs, either as part of standard care or efficient off-label use (see Data and methods). These drugs have defined exome mutations that serve as indications for their use. The majority of these drugs target ordered domains, mostly inhibiting kinases. Currently only 7 drugs are attached to disordered region mutations, which correspond to only four sites in FGFR and c-Met. These drugs act indirectly, targeting ordered kinase domains, to counteract the effect of the listed activating disordered mutations.
This represents a clear significant negative treatment option bias against patients whose tumor genomes contain disordered drivers. Considering all mutations in patient samples gathered in TCGA, the fraction of disordered driver mutations actually serves as an indicator of whether there are suitable drugs available. Patients with mostly ordered driver mutations have a roughly 50% chance that an FDA-approved drug can be administered with expected therapeutic effect. This chance drops to 10% for patients with predominantly disordered mutations (Figure 4D). Thus, incidences of cancer arising through disordered driver mutations are currently heavily under-targeted, highlighting the need for efficient targeting strategies for IDP driven cancers.
5. Targeting strategies can be developed for disordered drivers by considering protein features
Disordered drivers can be targeted by indirect strategies
To date, the most viable targeting approaches are enzyme inhibition(Griffith et al., 2010; Pathania et al., 2018; Scatena et al., 2008), which yielded several FDA-approved anti-cancer drugs, and protein-protein interaction (PPI) disruption, which yielded several compounds with great therapeutic promise(Ivanov et al., 2013). However, drugs directly targeting disordered regions are still lacking. Here we examined the strategy of indirect targeting, ie. compensating for disordered driver mutations with small molecule/peptide based inhibitors against ordered domains, developed based on existing strategies. Through a few selected examples, we analyzed how the discussed features of driver proteins can help to establish successful target selection (Figure 5), while the cancer type distribution analyses of chapter 4 provide cancer subtypes, where these targeting options provide a significant therapeutic gain.
As a general rule, the targeted protein domain should be as close to the mutated IDR as possible. Therefore, the first consideration should be the modular architecture of disordered drivers, and the identification of ordered domains for inhibition in the same protein. This approach works for drivers with an oncogenic effect (excluding tumor suppressors), harboring an activating disordered mutation. It also assumes an appropriate molecular mechanism (mostly excluding disordered domains), and finally the presence of an ordered module. For these cases, the next decisive information is whether the ordered module has an enzymatic activity. For enzymes, the relevant targeting approach is likely the use of enzyme inhibitors, such as in the case of Kit. Targeting the Kit kinase domain yielded several FDA approved drugs against GIST, melanoma and thymic tumors(Abbaspour Babaei et al., 2016), one of which has proven to be effective for the studied IDR mutations in a clinical setup as well(Groisberg and Subbiah, 2017). For domains without catalytic activity, PPI inhibition can be used to block the upregulated interactions that are responsible for the oncogenic activity, such as in the case of β-catenin. This approach produced several compounds that have shown promise in vitro and in mouse xenograft models of colorectal cancer(Shin et al., 2017).
While this selection approach can yield viable targets in a nearly automated fashion, the use of extra information about the driver proteins can offer better solutions. For example, mutations in CSF-1R upregulate receptor activity, therefore, kinase inhibitors against CSF-1R could counteract over-activation, similarly to Kit inhibitors. However, the overactivity of CSF-1R still depends on the binding of CSF-1, and - uncharacteristically for RTKs - does not bypass upstream regulation. As a result, aberrant signaling can be shut down with more easily accessible extracellular anti-CSF- 1 antibodies. Similarly, blocking the cyclin D3:Cdk4/6 interaction would counteract cyclin D3 upregulation arising from degron mutations. However, taking into account that the direct interaction partner Cdk4/6 has enzymatic activity, enzyme inhibitors against Cdk can be more efficient. This substitute inhibitor approach yielded ribociclib, an FDA approved drug for breast cancer incidences with functional Rb checkpoint(Barroso-Sousa et al., 2016).
When there is no immediate target for enzyme/PPI inhibition, suitable target domains can be located more distantly by analyzing the molecular toolkits the driver is involved in. In this case, the primary requisites of efficient targeting (oncogenic role, the presence of an ordered module, and the preference for enzymes) can be applied to proteins in the same toolkit. For example, histone H3.3 mutations - abolishing recognition by EZH2 methyltransferase - cannot be targeted by same-protein domains, as ordered histone dimers have no suppressible activity, and histone SLiM mutations do not have an activation effect. However, targeting the UTX [KDM6A] histone demethylase, which is also involved in the DNA structural modification toolkit, provides an option to counteract the effects of impaired histone methylation. As histone H3.3 G34 mutations disrupt EZH2 methyltransferase activity, resulting in aberrantly low histone methylation, UTX is also a candidate target to counteract disordered mutations resulting in EZH2 repression. This approach has shown promise both in vitro and in vivo against AML(Li et al., 2018).
In all cases, the corresponding hallmarks of cancer can give an indication of the achievable cellular effect of the applied inhibition (Supplementary Table S3). Both Kit and CSF-1R inhibition is expected to shut down proliferative signaling and to be effective against metastasis, which is a usual scenario for GIST, melanoma and breast cancer. Counteracting β-catenin mutations is expected to be effective against replicative immortality and angiogenesis, which is crucial for solid tumors, such as colorectal cancers. The main expected effect of Cdk inhibition is shutting down proliferative signaling, which is essential for breast cancer. Histone mutations are mainly connected to enabling replicative immortality, and targeting that hallmark (as opposed to angiogenesis) can be effective even in the case of non-solid tumors, like AML. All five presented target proteins in Figure 5 offer points of targeting for hallmarks that are essential for the associated cancer (sub)types, where their targeting can potentially have a huge advantage in a high fraction of know cases (with incidence rates in the 10.7-97% range, see Table 1).
Discussion
The identification of cancer driver genes and the elucidation of their mechanisms of action is the first step in developing efficient therapeutics. Successful identification of drivers is possible using highly customized bioinformatics pipelines that analyze genetic alterations distilled from genomic screening of tumor samples. A combined large-scale method using TCGA full exome sequencing data recently provided a catalogue of hundreds of driver genes(Bailey et al., 2018). While the method covers several main players behind common cancers, it misses many driver genes that were identified mainly in targeted efforts in cancer subtypes, which are not covered by large scale screens, mainly due to lower incidence numbers. In addition, several of the applied methods are focused on structural description of tumorigenesis at the protein level, biasing the results to preferentially cover functional protein driver regions that have well defined structures.
To get an unbiased assessment of the role of protein disorder in cancer, we used a single, structurally unbiased method on mutation data from both full exome and targeted sequencing efforts, and annotated the resulting driver set with high-quality, manual structure assignations. This resulted in a fully annotated catalogue of disordered drivers, containing 42 driver proteins containing 47 disordered driver regions. We only considered high-confidence regions to reduce noise, and to gain a largely method-independent set of drivers. Results of prediction methods in general depend on the architecture of the predictor; however, in the high-confidence regime iSiMPRe gives balanced results, largely in agreement with concurrent benchmarked methods(Porta-Pardo et al., 2017). While several of the identified driver IDRs are novel, in general, disordered driver regions have been present across several releases of the COSMIC datasets (Figure 6), starting with a few well-characterized examples, such as β-catenin and RTKs. The steady increase of the number of these examples with the continuing growth of cancer mutations, forecasts the emergence of further disorder-driven genes among strong drivers in the future. Furthermore, these examples can be complemented by additional disordered regions that are altered by more complex genetic mechanisms in cancer, such as specific frameshift mutations in NOTCH1(Wang et al., 2011), chromosomal translocations in BCR(Ballerini et al., 2012), or copy number variations in p14ARF (Lesueur et al., 2008). While our current collection is incomplete, roughly 30% of the contained proteins are not covered by recent pan-cancer identification efforts(Bailey et al., 2018), representing a major extension of drivers (Supplementary Table S3). In addition, even this limited set allowed us to understand some of the basic properties and common themes of how IDPs contribute to cancer development through their distinct structural and functional properties.
The collected disordered drivers, in agreement with the known versatility of IDRs, can function in multiple ways (Figure 2). They can form short linear motifs that mediate interactions with specific globular protein partners, but can also act as auto-regulatory regions, RNA/DNA binding regions, disordered domains, and flexible linkers(Piovesan et al., 2017; van der Lee et al., 2014). As common to many IDPs, they can form complex molecular switches and increase the interaction capacity of proteins(Dosztányi et al., 2006; Van Roey et al., 2012), which provides means for their mutations to have a wide modulatory effect. While these typical IDP functions cover most known cases, a small subset of disordered drivers have currently unknown perturbed molecular functions, due to the lack of available structural data. However, as each molecular function is reflected in charactersitic distribution of mutations, the analysis of mutation patterns provide a way of assessing the underlying mechanistic features. For example the extremely localized patch of dominantly missense mutations in MLH1 is characteristic of linear motifs, while the dominance of inframe indels in MED12 probably indicate a linker/autoinhibitory function, or a short binding region, possibly mediating interaction with cyclin C(Turunen et al., 2014).
While disordered drivers function through distinct molecular mechanisms compared to ordered drivers, these differences diminish at the level of protein function. Most biological processes that give rise to the hallmarks of cancer can be altered both via ordered or disordered regions. Nevertheless, four key process groups, including the alteration of gene expression regulation, DNA organization, protein degradation, and RNA processing, translation and folding, are more likely to be modulated through disordered mutations (Figure 3). Yet, despite these preferences, all ten hallmarks of cancer can arise through both ordered and disordered mutations alone, highlighting that IDP mutations are a sufficient means for tumorigenesis. This is in line with our current results, showing that disordered protein region mutations are in fact sufficient for tumorigenesis. For several cancer types, a subset of patients carry only disordered driver mutations in their exome (Figure 4), which shows that – at least in these cases – there is not only a correlation between the presence of protein disorder and cancer, but there is a causal relationship.
While disordered mutations are the sole genetic events behind many disease instances, currently available cancer drugs are heavily biased against treatment for these cases. As a result, patients with mostly disordered mutations have considerably worse chances of having effective treatment options. IDRs would be especially important candidates for therapeutic interventions for several cancer types, such as liver, adrenocortical, and kidney cancers, together with diffuse large B-cell lymphoma. For some rarer cancer subtypes, and several benign but locally invasive neoplasms, counteracting IDP mutations seem to be the most efficient targeting option, and may be the only viable option for several subclasses of cancers (Table 1).
The successful identification of disordered drivers and corresponding tumor types provides the first step in successful targeting. In general, the direct targeting of IDPs requires fundamentally different approaches, and the development of such methods is being viewed as a new direction for cancer therapeutics(Kulkarni, 2016). Considering that most identified disordered drivers mediate an unusually high number of molecular interactions, the therapeutic modulation of these interactions can have therapeutic benefits. Compounds, such as nutlins, that block interactions between IDPs and their ordered domain partners have been extensively studied(Tisato et al., 2017). Other direct IDP targeting approaches under development focus on the inherent structural properties of disordered functional regions(Neira et al., 2017). While these approaches hold great promise, they are yet to reach clinical relevance(Metallo, 2010), and current drugs almost exclusively act on globular domains, leaving a sizeable portion of cancer patients without treatment options. In such cases, currently the only viable strategy is indirect targeting, using inhibitors against the catalytic activity or the interactions of an ordered domain, which can compensate for the alteration of the disordered region. The detailed understanding of the molecular architecture of the disordered driver, and the network it is embedded into, can offer ways for successful target selection in such cases.
The molecular, the network/pathway, and the cellular level information pertaining to the effect of disordered driver mutations provide complementary information for building generic target selection strategies (Figure 5). The described examples show that these strategies can be built using a decision tree-like approach for the identification of a suitable ordered protein module either inside the driver protein, or in the same pathway. The targeting options of Kit serve as proof-of- principle for our target selection approach. Imatinib is an FDA-approved kinase inhibitor designed against the ordered kinase domain of Kit mutated in GIST. Lately, Imatinib has been shown to be also effective against disordered exon 11 deletions and insertions, corresponding to a more aggressive subtype of GIST(Groisberg and Subbiah, 2017). In treatment regimens spanning three years, Imatinib was able to yield similar survival rates for GIST cases with both ordered and disordered Kit mutations. In this case, a drug developed based on classical structural considerations was proven to be highly effective against specific IDR mutations. Such a relationship can be accurately identified by our proposed indirect target selection protocol.
In general, the systematic analysis of disordered regions mutated in cancer can open up new, major treatment options – even solely with new approaches to target selection but without necessitating radically new drug development methods. This could be especially important in already somewhat targetable tumors, and can provide targets in rarer cancers that were previously deemed untargetable (e.g. glioblastoma, endometrioid carcinoma, or chronic myelomonocytic leukaemia), or in the case of less invasive or benign tumors that pose surgical difficulties (e.g. uterine leiomyomas in fertile women or hypophyseal adenomas) (Table 1). As a next step, combination therapy targeting several disordered drivers - such as c-Myc/ID- 3/Forkhead box protein O1 and cyclin D3, which are all heavily mutated in Burkitt lymphoma(Schmitz et al., 2012) - may be the rational option instead of the currently available aggressive chemotherapy. Disordered drivers typically show high cancer type specificity, and their mutations often correspond to specific cancer subtypes (see Figure 4 and Table 1). Furthermore, most cancer types can have heterogeneous mutation patterns at the patient level (Figure 4). Thus, in the context of personalized medicine, targeted IDR sequencing can provide the necessary information for indirect target selection, which in turn can provide the means for new therapeutic interventions.
Author Contributions
B.M. contributed to conceptualization, development of methodology and software, formal analysis, investigation of the findings, developing resources, data curation, writing the manuscript, visualization of data, project administration, and acquisition of funding. B.H.S. contributed to developing software, formal analysis, investigation of the findings, writing the manuscript, and visualization of data. A.Z. contributed to conceptualization, investigation of the findings, writing the manuscript, visualization of data, and acquisition of funding. Z.D. contributed to conceptualization, development of methodology and software, investigation of the findings, developing resources, data curation, writing the manuscript, supervision, project administration, and acquisition of funding.
Declaration of Competing Interests
The authors declare no competing interests.
Data and methods
Definition of modules in human proteins
Protein regions corresponding to Pfam entities(Finn et al., 2016), binding regions in DIBS(Schad et al., 2018) and MFIB(Fichó et al., 2017), regions in DisProt(Piovesan et al., 2017) and IDEAL(Fukuchi et al., 2014), and regions with structures in the PDB(Berman et al., 2000), were considered to be separate functional modules. Pfam entities were classified structurally based on instances overlapping with either DIBS, MFIB, DisProt or IDEAL entries (annotated as disordered), or with monomeric single domain protein chains in the PDB (annotated as ordered). Pfam entities with no instances overlapping with any protein regions with a clear structural designation, were annotated using predictions, together with protein residues not covered by known structural modules. Such protein regions were defined as ordered or disordered using predictions from IUPred(Dosztányi et al., 2005; Mészáros et al., 2018) and ANCHOR(Dosztanyi et al., 2009; Mészáros et al., 2009). Residues predicted to be disordered or to be part of a disordered binding region, together with their 10 residue flanking regions were considered to form disordered modules. Regions shorter than 10 residues were discarded.
Census cancer driver genes
Known cancer driver genes were collected from the 16/01/2018 version of COSMIC census database(Futreal et al., 2004), complemented with examples from the literature(Vogelstein et al., 2013). Genes were categorized as oncogenes, tumor suppressor genes, or context-dependent genes according to annotations in the COSMIC census, in dedicated databases(Zhao et al., 2016) or the literature. The full list of census drivers is given in Supplementary Table S1.
Mutation data from COSMIC and TCGA
Cancer mutations were retrieved from the v83 version of COSMIC(Forbes et al., 2016) and the v6.0 version of TCGA. Mutations used from both databases include missense mutations, and in- frame insertions and deletions. Mutations were filtered similarly to the procedure described in(Mészáros et al., 2016). Mutations from samples with over 100 mutations were discarded to avoid the inclusion of hypermutated samples. Samples including a large number of mutations in pseudogenes or mutations indicated as possible sequencing/assembly errors in(Buljan et al., 2018) were also discarded. Samples were compared to each other in a pairwise fashion and samples sharing a large fraction of mutations were clustered together. In all analyses, only mutations from the sample with the highest number of mutations were kept. Mutations falling into positions of known common polymorphisms(Smigielski et al., 2000) were filtered. The final set of COSMIC mutations used as an input to region identification consists of 599,137 missense mutations, 4,189 insertions and 12,670 deletions from 253,568 samples. The final set of TCGA mutations used as an input to region identification consists of 274,109 missense mutations, 2,775 insertions and 2,900 deletions from 7,058 samples.
Genetic alterations considered in TCGA whole sample analysis
For sample/patient level analyses (Figure 4), somatic mutations, copy number alterations and expression levels were downloaded for all 33 TCGA cancer types via the GDC Transfer Tool. Fusion data were downloaded from the Tumor Fusion Gene Data Portal. In total, we used 10,921 TCGA samples for which at least one indication was found among the four types of data. A driver region was considered to be affected in a cancer sample when there were either (i) missense point mutations and/or in-frame indels within the region, (ii) nonsense or frameshift mutations anywhere in that gene, (iii) when the gene was over- or under-expressed in the sample, or (iv) when the gene was involved in a fusion event. We tagged the gene as overexpressed, if the elevated expression level was accompanied by an increase in the copy number. Similarly, we required copy number reduction, along with decreased expression, level for considering underexpression. We found (i) missense mutations and/or in-frame indels falling in any region in 6,427 samples, (ii) driver genes with truncating mutations in 4,263 samples, (iii) over- and under- expressed driver genes in 1,304 and 317 samples, respectively, and (iv) driver genes involved in fusion events in 778 samples. Taken together, there were 8,444 samples (77.3%) with driver genes affected by at least one alteration.
Identification and categorization of driver regions in cancer-associated proteins
Driver regions were identified using iSiMPRe(Mészáros et al., 2016) with the filtered mutations from COSMIC and TCGA, separately. Then, regions obtained from COSMIC and TCGA mutations were merged, and p-values for significance were kept from the dataset with the higher significance. Only regions with p-values lower than 10−6 were kept. Regions were initially assigned ordered or disordered status based on the structural annotation of the corresponding functional unit, incorporating experimental data as well as predictions. (see Definition of modules in human proteins). The final ordered/disordered status of the identified regions was based on manual assertion taking into account information from the literature, if available (Supplementary Table S2). For the disordered regions, the level of supporting information for the disordered region is also included (Supplementary Table S3).
Protein-protein interaction network analysis
Binary protein-protein interactions for the human proteome were downloaded from the IntAct database(Orchard et al., 2014) on 06/05/2018. Data were filtered for human-human interactions only, where interaction partners were identified by UniProt accessions. Interactions from spoke expansions were excluded. Interactions were kept in an undirected way. (Values for disordered drivers are quoted in Supplementary Table S3).
Gene Ontology annotations
Gene Ontology terms (GO(Ashburner et al., 2000; The Gene Ontology Consortium, 2017)) were used to quantify interaction capabilities, involvement in various biological processes, molecular toolkits, and hallmarks of cancer. In each case a separate collection of GO terms (termed GO Slim) was compiled. Each GO Slim features a manual selection of GO terms that are independent from each other, meaning that they are neither child or parent terms of each other. Terms were assigned a level showing the fewest number of successive parent terms that include the root term of the ontology namespace (considered to be level 0).
GO term enrichments in a set of proteins were calculated by first obtaining expected values. Expected mean occurrence values for GO terms together with standard deviations were calculated by assessing randomly selected protein sets from the background (the full human proteome) 1,000 times. The enrichment in the studied set is expressed as the difference from the expected mean in standard deviation units.
GO Slim for assessing interaction capacity: terms from levels 1-4 from the molecular_function namespace were filtered for ancestry and only the more specific terms were kept. I.e. terms from levels 1-3 were only included if they have no child terms. Only terms describing interactions containing the keyword ‘binding’ were kept. Individual terms are shown in Supplementary Table S4.
GO for the assessment of process overlaps: terms from levels 1-4 from the biological_process namespace were filtered for ancestry and only the most specific terms were kept. Only those terms were considered that were attached to at least one protein from the set studied (full human proteome, ordered drivers, or disordered drivers). Individual terms are shown in Supplementary Table S5.
GO for molecular toolkits: biological_process terms attached to proteins with identified regions were filtered for ancestry. The resulting set was manually filtered, yielding 93 terms, which were manually grouped into 16 toolkits. Enrichments for toolkits were calculated as the ratio of the sum of expected and observed values for individual terms. Individual terms and enrichments for each toolkit are shown in Supplementary Table S6.
GO for hallmarks of cancer: Terms were chosen from the biological_process namespace via manual curation using the GO annotations of known cancer genes as a starting point. Terms were only kept if they showed a significant (p<0.01) enrichment on proteins in the full census cancer driver set compared to randomly selected human proteins. Individual terms and enrichments for each hallmark are shown in Supplementary Table S7.
Current anticancer drugs from OncoKB
The list of anticancer drugs and their targets were downloaded from OncoKB(Chakravarty et al., 2017). This list contains 247 indications, which consist of 83 drugs targeting 45 genes with varied alterations at different levels. We filtered for levels 1, 2 and 3 only (FDA-approved, Standard care, and Clinical evidence, respectively). From the different alterations we selected point mutations, indels, truncating mutations, fusions and amplifications, irrespective of the listed cancer types.
There were 28 targetable positions in 10 genes with 24 associated drugs. There were only 4 disordered positions targeted by 7 drugs (AZD-4575, BGJ-398, JNJ-42756493 and Debio1347 targeting FGFR residues 370, 371 and 373; and Cabozantinib, Capmatinib and Crizotinib targeting MET in position 1010). Notably, 5 of the 28 positions (EGFR-747, KIT-654, KIT-670, MET-1010 and MTOR-2014) were not mutated in TCGA. Targetable insertion and deletion events were only assigned to EGFR exon 19. Only PITCH1 had targetable truncating mutations. There were 13 actionable fusion events. For most cases only one fusion partner was defined, and there were only two cases for which both fusion partners were stated (BCR-ABL1, PCM1-JAK2).
Comparison between protein features
The shared information content/association between various protein features were assessed using Jaccard indices. All of the following considered features were converted into binary descriptor vectors (see Supplementary Table S3): molecular functions (protein level - 6 descriptors), number of PPI partners (3 descriptors - network level), involvement in molecular toolkits (16 descriptors - network level), involvement in hallmarks of cancer (10 descriptors - cell level), and tumorigenic character (2 descriptors - cell level). The Jaccard indices calculated between all possible pairs of descriptors are given in Supplementary Table S8, and the average Jaccard indices between all possible pairs of features are given in Supplementary Table S9.
Data availability
The authors declare that the data supporting the findings of this study are available within the paper and its supplementary information files.
Supplementary material
Supplementary table legends
Table S1. List of cancer driver genes. Genes were taken from the COSMIC census list as of 21/03/2018, and were complemented with known cancer genes from the literature.
Table S2. List of regions identified using iSiMPRe, based on both COSMIC and TCGA mutations.
Table S3. Identified disordered driver genes with all annotations.
Table S4. Gene Ontology terms used in the quantification of interaction capabilities.
Table S5. Gene Ontology terms used in the quantification of biological process overlaps.
Table S6. Gene Ontology terms used in the quantification of molecular toolkits used by cancer driver genes.
Table S7. Gene Ontology terms used in the quantification of hallmarks of cancer.
Table S8. Jaccard indices of the similarity between all studies protein features.
Table S9. Averaged Jaccard indices of the similarity between various merged features.
Supplementary figures
Acknowledgements
This work was supported by the “Lendület” grant from the Hungarian Academy of Sciences (LP2014- 18) (Z.D.), OTKA grants (K108798 and K129164) (Z.D), the EMBO|EuropaBio fellowship 7544 (B.M.), and the grant PD-120973 of the National Research, Development and Innovation office of Hungary (A.Z). The authors thank Mark Adamsbaum and Drs Toby J. Gibson, Péter Tompa and László Buday for the critical reading of and their constructive comments on the manuscript.