Abstract
Background Malignant tumors originate from genomic and epigenomic alterations, which lead to loss of control of the cellular circuitry. These alterations relate with each other in patterns of mutual exclusion and co-occurrence that affect prognosis and treatment response and highlight the need for multitargeted therapy. However, to the best of our knowledge, there are no systematic reports in the literature of co-dependent and mutually exclusive mutations across all types of cancer. In addition, the studies reported so far generally deal with whole genes instead of specific mutations, ignoring the fact that different alterations in the same gene can have widely different effects.
Results Here we present a systematic analysis of co-dependencies of somatic mutations across all cancer types. Combining multi testing with conditional and expected mutational probabilities, we have found pairs and networks of co-mutations and exclusions, some of them in particular types of cancer and others widespread. We have also determined that driver loci are present in more types of cancer than non driver loci, that they tend to pair within a single gene and that those pairs are more often exclusions than co-mutations.
Conclusions Based on this properties, we propose new drivers that warrant experimental validation. Our analysis is potentially relevant for cancer biology and classification, as well as for the rational selection of multitargeted therapeutic approaches.
Background
Cancer is one of the most important health problems worldwide and, despite recent advances in diagnosis and therapy, cancer-associated mortality remains unacceptably high. Malignant tumors originate from genomic and epigenomic modifications which lead to loss of control of the cellular circuitry. Alteration of specific pathways enables tumors to bypass or activate a particular set of cellular processes, the so-called hallmarks of cancer [1], that confer tumor cells with adaptative advantages.
A particular biological pathway in cancer cells can be altered by somatic mutations or other changes in several genes. For example, in glioblastoma multiforme (GBM) the p53 pathway is downregulated in up to 87% of the tumors; but the genetic basis of this downregulation varies from patient to patient being the possible causes: somatic mutations or homozygous deletion of the protein p53 (TP53) or the cyclin dependent kinase inhibitor 2A (CDKN2A) and amplification of two genes codifying the double minute proteins (MDM2/MDM4). This and other examples have provided increased evidence that genetic alterations in cancer-related genes cluster within a limited set of essential biological pathways [2, 3, 4].
Tumor profiling projects have also unveiled mutually exclusive alterations across many patients, including driver mutations in specific genes [5]. For instance, TP53 mutations and MDM2 amplification in GBM very uncommonly occur together (few patients harbor both lesions). Additional examples include the mutual exclusivity in colorectal cancer between mutations in the adenomatous polyposis coli protein (APC) and catenin beta-1 (CTNNB1) genes (both involved in the beta-catenin signaling pathway) or mutations in BRAF and KRAS (genes of the RAS/RAF signaling pathway). In serous ovarian cancer, a mutual exclusivity between mutations of the breast cancer type susceptibility proteins BRCA and BRCA2 and the epigenetic silencing of BRCA1 has been observed while mutations in EGFR and KRAS are mutually exclusive in non-small lung cancer [2].
But cancer profiling has also discovered several cases of co-occurring alterations, suggesting that some changes in associated pathways may elicit complementary rather than redundant effects [6]. Examples include the PTEN (Phosphatidylinositol 3,4,5-trisphosphate 3-phosphatase and dual-specificity protein phosphatase) deletion concomitant with ERBB2 (Receptor tyrosine-protein kinase) amplification in breast cancer [7], MET (Hepatocyte growth factor receptor) activating mutations when VHL (Von Hippel-Lindau disease tumor suppressor) is deleted in renal carcinoma [8] and the CDKN2A suppression together with BRAF activating mutations in melanoma [9, 2].
Despite all this evidence, to the best of our knowledge, there are not systematic analysis in the literature of co-dependent mutations across all types of cancer. The only studies available, which used the TCGA dataset and the 2008 version of COSMIC, deal with whole genes and not specific mutations [10, 2, 11, 12, 13, 14], ignoring the fact that different mutations in the same gene can have widely different effects, i.e. G735S, G796S and E804G induce oncogenic activation of EGFR in prostate cancer while R841K has no functional relevance [15].
A better and wider understanding of co-dependencies between mutations is relevant in many aspects, such as tumour classification, diagnosis or treatment choice. At this respect, co-dependency relationships between genetic alterations evidence mutational epistasis [16] and highlight the need for multitargeted therapy. Several new generation antitumor drugs target proteins carrying specific driver mutations, i.e. sorafenib is active against renal and hepatic cell carcinomas harboring the BRAF.V600E mutation [9]; imatinib against gastrointestinal stromal tumors with mutations V560G, K642E, N822H or N822K in KIT or mutation V561D in PDGFRA [17]; gefitinib, erlotinib and afatinib against non-small cell lung cancers with exon 19 deletions or L858R in EGFR [18]; and dabrafenib or vemurafenib against BRAF.V600E in melanoma [19]. However, these treatments focusing in a single alteration are almost invariably followed by relapse due to selection of resistant cells [20]. Multitargeted approaches against co-occurring, biologically relevant mutations have the potential to delay the onset of resistance, and a better understanding of co-occurring oncogenic alterations could be of help in this setting. It must be remembered that some of these approaches are already in clinical use, such as the combination of BRAF and MEK inhibitors in metastatic melanoma [21] and many others are currently being tested in clinical trials [22].
Here we present the first systematic analysis of COSMIC somatic mutations [23] aimed to uncover cancer specific patterns of mutation associations, demonstrating that such an analysis is feasible and renders valuable information.
Results
Data was downloaded from COSMICv75 and filtered to obtain a dataset of recurrent non synonymous mutated protein positions in common cancer types. A cancer type was defined as the unique combination of tissue, histology and sub-histology (e.g: lung /carcinoma /adenocarcinoma). Mutations were grouped by position in loci, e.g. BRAF.V600E, BRAF. V600D, BRAF.V600K and BRAF.V600R were grouped as BRAF.V600. Positions with driver mutations were identified according to Kin-driver database [24] in the case of protein kinases and to the literature in the case of NRAS, KRAS, HRAS [25, 26], PI3KCA [27] and TP53 proteins [28]. The dataset used for the analysis was composed of 1,098,411 samples from 687 cancer types with 365,096 mutated loci (289 of them are known driver loci) in 1,329 genes (see methods).
In order to avoid multi testing correction, previous approaches had focused on the identification of clusters of mutated genes through exploration of cellular pathways and statistical testing of significance [10, 2, 11, 12, 13, 14]. In contrast, we have used a simple pipeline combining multi testing with conditional and expected mutational probabilities to define pairs of co-dependent loci in the different types of cancer. Those pairs were subsequently merged in a single network where general traits about cancer co-mutation and exclusion could be observed.
Counts of co-sequenced loci reveal pairs of significantly related mutations
As our first interest resided in related mutations, we tested all the pairs of co-sequenced loci where each member of the pair was mutated more than 10 times. This analysis was made in each cancer type, since admixing could hide signals characteristic of a particular malignancy. The arbitrary threshold of 10 mutated samples attempted to filter out uninteresting mutations that might co-occur randomly and to make the analysis more comprehensible. Thus, we obtained a starting dataset containing 262375 pairs of co-sequenced loci from a total of 135 cancer types.
Each pair of loci was tested for co-dependency using the exact Fisher test, which compares the number of samples with the two loci mutated with those with one or none. We found these numbers to be highly unbalanced due to the differential sequencing of loci -with some of them extensively and others only rarely sequenced-and the large proportion of samples with no mutations (supplementary figure 11). In consequence, false positives were possible; in particular, poorly co-sequenced pairs could be falsely detected as co-dependent owing to contingency tables with low values. To avoid this problem, we enhanced previous algorithms by combining the co-dependency tests with a comparison between observed and expected co-mutation probabilities (see methods “Filtering of probable false positives”). Thus, we ended up with 30,679 pairs of dependent loci in particular cancer types whose relation is not an artifact due to low mutational frequencies.
Conditional probability discriminates the type of dependency between pairs of related loci
If the probability of mutation of a locus increased when another mutation in a different locus is present, we considered that those loci co-mutated. That is, if the a posteriori mutation probability of a locus surpassed its a priori mutation probability, a co-mutation was assigned in a particular cancer type; if the opposite happened, mutations were considered to be mutually exclusive. The difference between a priori and a posteriori probabilities was usually low; in consequence, we defined the confidence interval of the a priori probabilities for each mutation and subsequently classified the mutations according to the rules depicted in figure 1.
If P(mi|mj) < CI(mi) AND P(mj|mi) < CI(mi) → mutual exclusion
If P(mi|mj) > CI(mi) AND P(mj|mi) > CI(mj) → bidirectional co-mutation
Where P(mi|mj) is the conditional probability of mi given mj and CI(mi) is the 95% confidence interval of the a priori probability of mutation of mi.
If the mutation probability of a locus augmented given a second mutation but the reciprocal didn’t happen, the co-mutation would have been considered unidirectional. However, in our analysis we didn’t encounter a single case of this class of dependency
Network of loci with co-dependent mutations
As a result of our analysis, we found 189 pair of exclusions and 30,490 of bidirectional co-mutations (in total: 30,679 pairs of co-dependent mutations) involving 568 loci across 67 cancer types and 22 tissues of origin. The involvement of only 568 loci in the 30,679 pairs highlights the fact that a locus can mutate repeatedly in different types of cancer.
As 30,272 of the 30,679 pairs (98.67%) were found in upper aerodigestive tract/carcinoma/squamous cell carcinoma (supplementary figure 2, also interactive in http://sdmn.leloir.org.ar/, bottom link), they were studied separately (see below), and our analysis focused on the rest of the cancer types. Thus, we ended up with 407 pairs of related mutations, 218 co-mutations and 189 exclusions, involving a total of 260 loci from 94 genes in 66 cancer types. The pairs are shown in the network depicted in figure 2 and can be interactively explored in the following link: http://sdmn.leloir.org.ar/. The user can look for a locus (i.e: KRAS.G12) in the search box or see the dependencies it has by clicking the node. Also, by applying appropriate filters, users can see the relationships within a particular type of tumor or gene, display only the co-mutations, the exclusions, the genes listed in the Cancer Gene Census [29] and the driver mutations. In addition, Figure 2 integrates the mutational probability of each position (size of the node), the tissues where the pair occurs (edge color) and whether a protein pertains to the Census (black outline at the nodes) and if the locus is known to have driver mutations or not (red outline at the nodes). It is worth noting that we have found dependent mutations within 94 genes from a starting dataset of 1,329 genes, thus meaning that the remaining 1,235 genes have mutations that do not significantly associate with others.
Distribution of repeated pairs of mutations across cancer types and prevalence of proteins in some cancer types
The network depicted in figure 2 shows how some pairs of associated loci appear in different cancer types (nodes connected by more than one edge) and also how pairs of loci involving particular genes are prevalent in some tumors.
Regarding pairs repeated across different types of cancer, we found a total of 52, involving 58 loci from 24 proteins, all of them reported in the Cancer Gene Census. Additionally, of those 58 loci, 31 (57%) are known to harbor driver mutations. About two thirds (34) of the 52 pairs were found in different tumors from the same tissue of origin while 18 appeared in different tissues of origin. Although most of the 52 repeated pairs behaved equally across cancer types, we found a few examples where the type of association diverged; namely 6 pairs, which were mutually exclusive in most cancer types but occasionally co-occurred (see supplementary table 2).
We also found an abundance of specific genes in a particular type of tumor like CCAAT/enhancer-binding protein alpha (CEBPA) loci in 16 out of the 27 pairs found in acute myeloid leukaemia; KIT (Mast/stem cell growth factor receptor) loci as a partner in 19 out of 20 significantly related pairs of soft gastrointestinal stromal tumor (GIST); and 54 of 60 pairs in lung carcinomas involving loci from the epidermal growth factor receptor (EGFR). In fact, only large intestine/carcinoma/adenocarcinoma, the second cancer after upper aerodigestive tract/carcinoma/squamous cell carcinoma in number of pairs, escaped from this predominance of loci from specific proteins in the pairs in a particular type of cancer (figure 2 and supplementary figure 3a).
In our analysis we also found the presence of cliques, that is, groups of several interconnected loci, in the network of some tumors (all can be explored in interactive figure 2). Examples include the clique of exclusions between KIT.W557del, KIT.Y503ins, KIT.V559, KIT.V560, KIT.W557 and PDGFRA.D842 in soft tissue/gastrointestinal stromal tumour/NS and the clique formed by FGFR3.S249, FGFR3.Y373, FGFR3.G370 and FGFR3.R248 in urinary tract/carcinoma/transitional cell carcinoma. Other Interesting cliques were a group of 8 co-mutating loci involving 6 different proteins in kidney (kidney/carcinoma/clear cell renal cell carcinoma) and the 5 frameshifts co-occurring in SORBS2 in large intestine (see supplementary figure 3).
Driver loci can be distinguished from non-driver loci based on their associations
Looking for specific properties of the driver loci (encircled with red in figure 2), we noticed a large quantity of edges of different tumors arising from common driver nodes like BRAF.V600 and KRAS.G12 (edges color stand for cancer types), suggesting that driver loci interact in a higher diversity of cancers than non-drivers. When we checked this analytically, we found that the two groups of nodes have a significantly distinct degree, mutational frequency and number of cancers in which they are present, as can be seen in figure 3. Driver loci have more edges (Mann Whitney U Test p-value = 1.583e-05) and are present in more cancer types (Mann Whitney U Test p-value = 8.86e-12) than non-drivers, but show lower mutational frequencies (Mann Whitney U Test p-value = 0.006581).
We were expecting that the loci more frequently mutated would be more connected in more types of cancer. But, while connectivity and number of cancers were highly correlated (Pearson coefficient: 0.8408, p-value = 2.2e-16), connectivity and mutational frequency and number of tumors affected and mutational frequency were only partially correlated (Pearson coefficient: 0.5205, p-value = 2.2e-16 and 0.4733, p-value = 6.3e-16, respectively).
Loci with driver mutations tend to exclude while non-drivers tend to co-mutate
Next, we divided the pairs of related loci in those with (i) no driver, (ii) one driver and (iii) two driver loci. This classification was found to be associated with the type of relationship between the loci in the pair (X-squared p-value <2.2e-16). As shown in figure 4, the most common association between pairs of two driver loci is mutual exclusion, 80.5% (128/159). In contrast; 80.8% (151/188) of the no driver pairs were found to be co-mutations. It is worth noting that some well-known exclusions between driver loci appeared clearly in our analysis, such as the mutual exclusion of KRAS, BRAF and NRAS mutations in colorectal cancer [30](figure 2, zoomed in supplementary figure 3c).
The only 20 co-mutations where both members are drivers comprises 25 loci from 7 proteins in 12 cancer types. Only one pair of associated loci are of two different genes, KRAS.G12 + BRAF.V600 in thyroid/carcinoma/anaplastic carcinoma. There are 6 pairs that appear in more than one cancer type, 5 of them involving EGFR in different types of lung carcinoma, with the co-mutation between EGFR.L858 and EGFR.T790 being the most common (present in 5 cancers).
Driver mutation pairs tend to occur in the same gene
When we considered all pairs, we found loci of a single gene are associated almost as frequently as loci from two different genes (Mantel-Haenszel X-squared p-value =0.6578). Next, we applied the classification presented in Figure 4 and we found that pairs with no drivers and pairs with one driver do not show a preference to occur in the same or different genes. In contrast, pairs with 2 drivers behaved differently, with a majority of them associating loci from the same gene. Namely, 96.77% of driver co-mutations and 68.75% driver exclusions were found to involve loci in the same gene (Fisher Test p-value = 0.0009532; X-squared p-value = 0.002964).
The squamous cell carcinoma of upper aerodigestive tract mutates in an uncommon way
Upper aerodigestive tract/carcinoma/squamous cell carcinoma (SCC) is the name used to denote a variety of cancers, with divergent etiologies, originated in the epithelium of head or neck [31]. It is by far the least sequenced malignancy in the COSMIC database, with 2048 samples versus the 32,392 of large intestine/carcinoma/adenocarcinoma (the most sequenced one). SCC exhibit a high mutational frequency that could explain the extremely elevated number of significant dependent pairs encountered (Supplementary figure 2, and http://sdmn.leloir.org.ar -bottom link-). The 30,272 pairs found in SCC correspond to co-mutations associating 307 loci of 45 proteins. It is compelling that there are no exclusion pairs and no previously described driver loci involved. Only 8 of the 45 proteins (PDE4DIP, NCOR1, HLA-A, NOTCH1, NOTCH2, BCOR1, KMT2C and SETBP1) are reported in the Census, forming 411 pairs with 65 loci
Discussion
Although the advent of Next Generation Sequencing technologies has allowed the mutational profiling of thousands of tumor samples, no systematic studies have been published of co-occurring and mutually exclusive mutations. Here, we report a cancer-type specific bioinformatics analysis of the thousands of somatic mutations described in COSMIC, which were grouped in loci, aimed to discover co-occurrence and exclusion pairs. In our analysis, we found some well known pairs that validate our approach, such as the mutual exclusion of KRAS.G12/G13 with EGFR loci in lung cancer or KRAS.G12/13 with BRAF.V600 in several neoplasias. Another example is the co-ocurrence of EGFR.L858 + EGFR.T790 in lung adenocarcinoma. Somatic mutations in L858 confer sensitivity to tyrosine kinase inhibitors targeting EGFR (EGFR TKIs). However, patients ultimately relapse and one of the most common mechanisms of resistance to TKIs is the emergence of the p.T790M mutation. This observation constitutes an example of how a systematic analysis of co-ocurrences in tumor rebiopsies after progression to targeted therapies can help to find loci associated with resistance.
However, we also encountered controversial co-ocurrences like KRAS.G12 + KRAS.G13 in anaplastic thyroid carcinoma, prostate adenocarcinoma and papillary thyroid carcinoma or BRAF.V600 + KRAS.G12, again in anaplastic thyroid carcinoma. These mutations are generally regarded as mutually exclusive in most malignancies [32, 33]. To track this discrepancy, we reviewed the articles reporting co-mutations of this two loci. Garcia-Rostan et al, described co-mutations KRAS.G12 + KRAS.G13 in poorly differentiated (papillary) and undifferentiated (anaplastic) thyroid carcinomas, but did not discuss them further [34]. The same co-mutation but in prostate adenocarcinoma was reported in two articles [35, 32]; with Silan et al remarking the high Gleason Score and PSA (prostate specific antigen) levels on combinedly mutated patients, both being indicators of an aggressive tumor. Meanwhile, Costa et al identified a strong link between clinical parameters indicative of unfavourable prognosis and BRAF.V600E associated with other genetic events, such as the co-mutation BRAF.V600 + KRAS.G12 [36]. Another unexpected pair of associated loci was the co-mutation EGFR.L858 + EGFR.G719 in lung squamous cell carcinoma (LSCC), since the frequency of EGFR mutations in LSCC is low [37] and, as a consequence, the L858+G719 co-mutation is very uncommon, appearing only in approximately 1/2000 patients (frequency 0.0007520682).
Another issue that should be considered when trying to explain unexpected co-mutations is intratumoral heterogeneity. Subclonal populations within the same tumor [38] can explain the presence of mutually exclusive mutations, as sequencing admixes the genomes of different cells. Zou et al even report a BRAF mutation in a whole thyroid tumor, and RAS mutation only in some sections [39]. This examples of tumor heterogeneity reflect the need for serial examination of tumors during the course of therapy, and of different areas within a single tumor [40]. Genetic heterogeneity has also been linked to a variable clinical response to treatment, with primary tumors and metastatic regions responding differently to the same drug [40]. Whichever is the explanation for the unexpected co-mutations, they are probably capturing a general feature of the corresponding tumors that warrants testing and, if found, should be considered during treatment. In tumors with BRAF.V600 and KRAS.G12, either via a true co-mutation or due to tumor heterogeneity, both loci should be targeted to avoid the selection of subclonal resistance mutations through treatment [38].
During our search for mutational patterns, we made a number of general observations regarding pairs of associated loci. First, we discovered that loci within a single gene (or occasionally two genes) are present in a particular type of cancer, indicating a relevant role for the corresponding protein(s) in that specific malignancy. Some examples were CEBPA in acute myeloid leukemia, EGFR and KRAS in lung cancer and KIT in GIST. According to the literature activating KIT mutations are present in a majority of GISTs from soft tissue [41], while the oncogenes EGFR and KRAS in lung cancer, are not always mutated but serve for routinary diagnoses [42, 43]. Large intestine adenocarcinoma was the only relevant tumor that did not show an association with a particular gene in the co-mutational and exclusion patterns. This finding was not unexpected, since it has been demonstrated that several genes can play key roles in the development of this malignancy [44, 45].
We also observed that driver mutations present three common properties, namely (i) driver loci tend to mutate in a mutually exclusive fashion, (ii) driver loci pairs are repeatedly present in several cancer types and (iii) driver loci pairs frequently occur within the same gene. These three properties can facilitate the discovery of new drivers and, as a proof of concept of this idea, we searched loci fulfilling at least one property in our network of associations, finding a total of 172 possible new driver loci in genes of the Cancer Census (supplementary table 3). Of them, 15 were located in 4 protein kinases, which we further analyzed via structural alignment using the Kin-driver database. We found that 8 loci were in positions known to be drivers in other kinases, and 9 mapped to hyper mutated segments (-HS-) where driver mutations have been shown to cluster [46] (Table 1). Furthermore, one of these loci, KIT.D419del, has indeed been described as a driver [47] although it was not considered as such in our analysis; and there is experimental evidence suggesting a driver role for 7 additional loci [48, 49, 50, 51, 52].
In addition to protein kinases, some loci in other types of genes exhibited the three properties mentioned above so we consider them as candidate drivers. Examples include R282 in the TP53 gene or K385fs and L367fs in the calreticulin gene (CALR). Mutations in the TP53.R282 locus are relatively common and they exclude with the drivers TP53.R175 and TP53.R273 in two cancer types according to our results. According to literature mutation of R282W is actually a driver, it has been described to shift the DNA binding domain of the p53 protein to a dys-functional structure [53] and cause an earlier onset of familial cancer and a shorter overall survival in cancer patients [54]. Regarding CALR. L367fs and CALR.K385fs, they exclude with other frameshifts in the same gene, and also with thrombopoietin receptor (MPL) V515 and the know driver JAK2.V617 in three haematopoietic neoplasms. This pattern is not a new finding; the triad of exclusion amidst JAK2, MPL and CALR has been previously reported [55]. Frameshifts in exon 9 of CALR gene, such as CALR.L367fs and CALR.K385fs, are common in some myeloproliferative disorders and change the C terminal charge of the protein, altering its subcellular localization, stability and function [56, 57].
In addition to pairs, we found cliques of significantly related loci (listed in supplementary Table 4). The co-mutation cliques, where mutations in each loci are more probable when some of the other loci are mutated, can represent combinations conferring synergistic adaptative advantages to the tumor cells. An example is the clique formed by IDH1.R132, NPM1.W288fs, DNMT3A.R882 and FLT3.D835 in acute myeloid leukaemia that emerged in our analysis. Mutations in IDH1 and DNMT3A, and in DNMT3A and FLT3, have been described to be simultaneously present in a significant number of myeloid leukaemia patients, and concomitant mutations in the triad NPM1/DNMT3A/FLT3 have been associated with a worse overall survival in this malignancy [58, 59]. These experimental findings support our hypothesis of a synergistic advantage of co-mutation cliques. In contrast, exclusion cliques, probably reveal loci that alter the same biological pathway. A mutation in one of them is enough to acquire the corresponding hallmark of cancer and, therefore, tends to exclude the others [6].
The pairs and cliques of associated loci that emerged in our analysis, particularly those involving drivers, might prove useful not only in cancer biology studies, but also for the selection of therapies. Cancer treatment faces several challenges, such as the selection of appropriate markers for targeted and non-targeted agents and the relapse to a more aggressive disease after an initially successful treatment. Since most of the new antitumor drugs specifically target mutated or genetically altered proteins, co-mutations can suggest combined treatments that can prove more effective than single agent approaches. In contrast, exclusions might indicate that certain combinations of drugs are unlikely to be useful in a meaningful percentage of patients. For instance, we found that the IDH1.R132 and the TP53.R273 loci co-mutated in gliomas. Mutations in IDH1.R132 gene are very frequent in this malignancy and inhibitors of mutant IDH are currently in trials to prove their clinical utility as single agent or in combination strategies targeting additional oncogenic pathways [60]. Anti-mutant TP53 drugs are also being tested in clinical trials [61], and our results indicate that they might be an appropriate partner for IDH inhibitors in a significant number of gliomas.
The absence of links is also of interest. One example are mutations in PTEN that lead to the loss of protein expression and have been related to resistance to many targeted therapies, such as EGFR inhibitors in EGFR-mutant lung cancer [62], anti-EGFR antibodies in non mutated KRAS/NRAS colorectal cancer [63] or BRAF inhibitors in BRAF-mutant melanoma [64]. In our analysis, PTEN mutations are not significantly dependent to EGFR, KRAS, NRAS or BRAF mutations in different cancer types with the only exception of the co-mutation with KRAS.G12 in endometrium/carcinoma shown in the figure 2. In consequence, it can be estimated that, if the above-mentioned therapies are tested in patients of all the other tumor types, the percentage of cases not responding due to loss of PTEN will be equal to the overall frequency of PTEN mutations in that particular malignancy. Data of this kind can be of great interest when trying to find new applications for targeted agents.
One important limitation of our study derives from the fact that a vast majority of the tumors compiled in the COSMIC database have only sequenced a limited number of genes, while whole exome or whole genome sequenced tumors are scarce. In consequence, we are likely missing a significant number of associations between loci simply because they have been rarely sequenced. This limitation also explains a counterintuitive correlation found during our analysis, where the more mutated loci are not the more connected ones (Figure 3). For example, the G151 mutation of the potassium channel KCNJ5 was almost as frequently mutated in the samples of the COSMIC database as KRAS.G12, with values of 0.1827 and 0.1893, respectively; but KCNJ5.G151 was significantly less connected. The explanation lies in the fact that, while the ubiquitous driver (KRAS.G12) has been sequenced on 164511 samples from all types of tumor, the KCNJ5 status has only been reported in 2671 samples, most of them (81.2%) from adrenal gland/adrenal cortical adenoma/aldosterone producing, a cancer type where KCNJ5 mutations are prevalent [65]. To overcome this limitation, we plan to periodically update our analysis and we are confident that we will find new associations of loci relevant for both cancer biology and treatment.
The three-properties approach, might not be adequate to find possible drivers loci in upper aerodigestive tract squamous cell carcinoma. Driver mutations for this cancer may have exceptional characteristics or, more likely, drivers in this malignancy are genomic aberrations different from mutations. Experimental evidence seems to support this explanation, since non mutation drivers such as EGFR overexpression and amplification have been described to be frequent in this malignancy [66].
Conclusions
In summary, we can propose driver mutations based solely on the network of significantly related pairs of mutations. At now, driver predictions rely on interaction and functional networks focusing on complete genes [67, 68, 69], then, our approach have the advantage of pinpoint specific mutated positions, which enlightens the functional role they may be playing. All this prove the relevance of cumulative repositories like COSMIC and cBio, that aggregate enough data sets to search for significant patterns.
Methods
Dataset
Complete COSMICv75 was downloaded. Data included 1,178,444 samples from 47 tissues with 193 histologies and 716 sub-histologies; there were 2,812,088 mutations from 2,128,846 sequenced positions. To identify unique mutations, ENSEMBL transcript ID from COSMIC entries were mapped to UNIPROT protein ID and concatenated with the mutations (e.g P15056.V600E). Mutations were grouped by position following the type of alteration, so we could distinguish substitutions, deletions, insertions, complex substitutions and frameshifts.
Mutations of type Nonstop extension, Substitution coding silent and Whole gene deletion were filtered. This way the dataset decreased to 1,107,460 samples from 1,298 cancer types with 1,615,508 mutated positions from 19,297 proteins. Among these, 291 positions with driver mutations in kinases, Ras proteins, TP53 or PIK3CA were found. Cancer type was defined as the unique combination of tissue, histology and sub-histology (e.g: lung/carcinoma/adenocarcinoma).
In order to roughly discard unimportant (to cancer evolution) mutations, positions sequenced in less than 1,000 samples were filtered. The threshold is set empirically to a limit passed by most of the driver loci (289 from 291) but only by 22.60% of the loci without known driver mutations. Cancer types without mutations or sequenced less than 10 times, were also discarded. Then, the dataset for the analysis is composed by 1,098,411 samples from 687 cancer types 365,096 mutated positions (289 of them are driver mutations) of 1,329 proteins.
Co-dependant pairs of loci
To identify significantly related mutations, contingency tables were constructed for every pair of co-sequenced loci in each cancer type. To avoid very infrequent mutations, only the pairs with more than 10 mutated samples (out of 1000 or more sequenced samples) for each position were retained, giving a starting dataset of 262,375 tables (pair of mutations to be evaluated) of 135 cancer types. Each table was checked for dependency between loci with Fisher test. P-values were FDR corrected. A total of 44,250 pairs of loci were found to be significantly associated in 82 cancer types.
Filtering of probable false positives
As we feared that low mutational probabilities allowed unrelated pairs to pass the corrected Fisher test, we compared the expected and the observed probabilities for each pair. Specifically, scarcely mutated loci could seem to be excluding each other when they are just infrequent, so their expected joint probability would be similar to the observed one.
To filter this kind of false positives, we estimated the 99% confidence interval of each observed joint mutational probability and located the expected joint mutational probability. If the expected fell within the confidence interval of the observed, the pair was discarded, which is equivalent to filter all the dots near to the diagonal in the figure 1.
Confidence intervals were calculated using the binomial distribution. The distribution B(n, p) is defined for each pair with size n, as the total of samples with both loci sequenced and probability of success p, as the observed probability of co-mutation.
After this step a final dataset of 30,679 pairs of dependent loci whose relation is not due to their low mutational frequencies was obtained.
Assignation of type of dependency
To distinguish pairs that co-mutate from pairs that exclude mutually, a priori and a posteriori mutation probabilities were compared. Considering only samples with loci A and B co-sequenced, if locus A has a higher probability of mutation when mutation B is present than in the whole set of samples, A co-mutates with B. If the contrary is true, mutations exclude themselves. But the a priori mutational probability of a locus can vary randomly, so we estimated its 95% confidence interval as previously described. This way, each table defines its own thresholds but a coherent set of simple logic rules are used over the whole dataset:
If P(mi|mj) < CI(mi) AND P(mj|mi) < CI(mi) → mutual exclusion
If P(mi|mj) > CI(mi) AND P(mj|mi) > CI(mj) →bidirectional co-mutation
Where P(mi|mj) is the conditional probability of mi given mj and CI(mi) is the 95% confidence interval of the unconditional probability of mutation of mi.
Network of related loci
The 30,679 pairs were divided by the type of dependency in 189 pairs of exclusion and 30,490 of co-mutation. Upper aerodigestive tract/carcinoma/squamous cell carcinoma pairs were separated and two network representations were made in cytoscape. The one for the upper aerodigestive tract squamous cell carcinoma was ordered with the prefuse force directed layout based on the observed probability of co-mutation and enhanced manually to show all node labels. The other network, representing the rest of the pairs, also was enhanced manually but departing from an orthogonal layout.
Properties of driver loci
Degree, mutational frequency and number of engaged cancer types was calculated for each loci in the network. The number of engaged cancer types was normalized by the maximum possible, 66 cancers. Distributions of the four variables (degree, mutational frequency, number of engaged cancers and normalized number of cancers) were compared between drivers and non-driver with Mann Whitney tests. The relation among the variables was tested with Pearson correlation.
Pairs were classified in a 3-level contingency table by their categorical variables: type of relation within the loci, origin of the linked loci, and number of driver positions involved. This was plotted in a mosaic and tested for dependency with Mantel-Haenszel test. Afterwards the segment of the table counting pairs formed by two drivers was tested with chi square.
Loci corresponding to proteins of the Cancer Genes Census and satisfying at least one of the observed properties of drivers were suggested as new possible drivers. The new possible drivers belonging to kinases were mapped to driver hotspot regions or to previously described driver loci. All calculus were made in R. The packages used are: igraph, Matrix, doParallel and foreach.
Declarations
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Availability of data and materials
The dataset supporting the conclusions of this article is included within the article and its additional files.
Competing interests
The authors declare that they have no competing interests.
Funding
The work was supported by prestamo BID PICT2014-1787. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Author’s contributions
SO and CMB: had the idea, results generation, data analysis and discussion and manuscript writing. DJZ: statistics, EMP: web development, MAMV: assessor, discussions, writing the manuscript.
Author details
1Fundación Instituto Leloir, Avda. Patricias Argentinas, C1405BWE Buenos Aires, Argentina. 2Laboratorio de Oncología/Pangaea Oncology, Hospital Universitario Quirón Dexeus, C. Sabino Arana 5, 08023 Barcelona, Spain.
Acknowledgements
We thank Dr. Octavio Arizmendi Echegaray for its help with the statistical design. SOM: has a fellowship from CONCYTEG, CMB and DJZ is a researcher of the National research council (CONICET), EMP has an “agencia” (MinCyT) fellowship. The work was supported by prestamo BID PICT2014-1787.
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].