Abstract
Genomic aberrations in somatic cells are major drivers of cancers and cancers are of high genetic heterogeneity and most driver genes are only of moderate or small effect size. Existing bioinformatics methods poorly model background mutations and are underpowered to identify driver genes in typical-size samples. Here we propose a novel statistical approach, weighted iterative zero-truncated negative-binomial regression (WITER), to detect cancer-driver genes showing an excess of somatic mutations. This approach has a three-tier framework to improve power in small or moderate samples by accurately modelling background mutations. Compared to alternative methods, this approach detected more significant and cancer-consensus genes in all tested cancers. This technical advance enables the detection of driver genes in TCGA datasets as small as 30 subjects, rescuing genes missed by alternative tools. By introducing an advanced statistical model for accurately estimating the background mutation rate even in small-to-moderate samples, the proposed method is more powerful approach for detecting cancer driver genes than current methods, helps provide a comprehensive landscape of driver genes in cancers.
Introduction
It is well known that genomic aberration in somatic cells makes an important contribution to the development of cancers(1). Mutations that confer selective growth advantage to cancer cells are known as cancer-drivers (2) (3); a gene harboring driver-mutations is defined as a cancer-driver gene. It has been established, for example, that non-synonymous mutations in the two famous driver genes TP53 and PIK3CA contribute to many types of cancer (4). However, cancers are known to be highly heterogeneous(5) and many driver genes for most cancers remain to be identified. A full landscape of driver-genes is critical for early diagnosis, identification of effective drug targets, and precise treatments of a cancer (2).
There are generally two existing strategies to detect cancer driver genes, background mutation rate (BMR) and ratiometric. The BMR-based methods evaluate whether a gene has more somatic mutations than expected; examples include MutSigCV (6) and MuSiC(7). The expected number of mutations is estimated from multiple predictors including base context, gene size and other variables of “background” passenger genes. In particular, MutSigCV proposed using three extra variables (DNA replication timing, transcriptional activity and chromatin state in cancer cells) to improve the prediction of the expected background mutation rate. The ratiometric-based methods detect cancer-driver genes according to the composition of mutation types normalized by the total number of mutations in a gene. For instance, the ratiometric 20/20 rule simply assesses the proportion of inactivating mutations (including synonymous mutations) and missense mutations(3). Oncodrive-fm(8) and OncodriveFML(9) integrate mutation functional impact into the evaluation. OncodriveCLUST considers the positional clustering of mutation patterns(10). Recently a method 20/20 plus (11) extended the ratiometric idea in the 20/20 rule and integrated 18 additional features of positive selection to predict cancer-driver genes by a machine learning approach. It also generated statistical p-values of the prediction scores by Monte Carlo simulations.
Although the general principles of both strategies are simple, technical issues remain especially when sample size is not sufficiently large. For example, a recent study (11) found that the statistical p-values produced by existing cancer-driver gene methods did not follow uniform distribution, implying the underfitting of background mutations. Although simulation or permutation can correct the distribution, adequate fitting of background genes is critical for accurate discrimination of true driver genes from noise background genes. This issue will become more severe when the sample is too small to generate a stable model. Moreover, existing statistical tests are generally underpowered to detect driver-genes with small or moderate effect size. This may be a reason why a supervised approach integrating common gene features beyond collected samples was also proposed. However, given the high heterogeneity in cancers (6), adding more common features may not work for unique driver genes; and the trained model for known driver genes may have limited power for detecting new driver genes. Lastly, the predicted cancer-driver genes by different tools do not generally agree with each other(2). It is often laborious and subjectively biased to combine their results. Therefore, more powerful methods are pressingly needed for unraveling a full spectrum of cancer-driver genes.
Here, we describe a new statistical method, weighted iterative zero-truncated negative-binomial regression (WITER), to detect cancer-driver genes by somatic mutations at non-synonymous variants. This approach belongs to the unsupervised category and therefore does not suffer from training bias. The method has a unique three-tier structure to accurately fit the number of somatic mutations in background genes. This structural advance enables it to detect more driver genes in both small and large samples regardless of cancer types. Although it is basically a type of BMR approach, it also adopts the ratiometric idea to use silent mutations as an explanatory variable in the regression model. We then investigated its performance in 34 cancers. A comprehensive landscape of driver-genes was constructed by WITER and analyzed to investigate the common and unique insights across cancers.
Results
Overview of the statistical framework
We propose a unified statistical framework, WITER, for detecting cancer-driver genes by somatic mutations in cancers. The main input is somatic mutations in samples from cancer patients. The output is a table of p-values for excess of somatic non-synonymous mutations at individual genes; a significant p-value would suggest a driver gene that has more somatic mutations because such mutations confer selective growth advantages to cancer cells. Compared to alternative approaches, it has a unique three-tier structure to accurately fit the number of somatic mutations in background genes (See its diagram in Figure 1). In the first tier, it has an advanced model, iterative zero-truncated negative-binomial distribution regression (ITER), to fit the zero-inflation and overdispersion of mutation counts. As shown in the following section, the model can more accurately fit the background mutation counts compared to other widely-used models [Figure 3b and 3c]. In addition, p-values can be straightforwardly derived by deviance residuals of the regression [Figure 2 and 3a], so that conventional time-consuming simulations are not needed for significance evaluation. The iterative procedure reduces the distortion of the background mutation rates by the exclusion of the driver genes in the model, so that the relative excess of mutations can be measured more accurately. In the second tier, it can flexibly impose prior weights upon mutation counts to further boost statistical power. The prior weights are generated by a random forest model trained by a large dataset we curated from COSMIC(V83) [Figure S6]. The weighting scheme contributes to identification of extra significant genes which would be missed by same framework without weights [Figure 3b and 3c, Table S8]. In the third tier, it allows the integration of independent reference samples from either the same or different cancers to produce a stable background model. This solves the problem of model instability in small cancer samples. This feature enables WITER to produce statistically valid p-values (Figure S1) and to detect multiple significant driver-genes (Table 1) in datasets with around 30 subjects. The approach and auxiliary functions have been implemented into a user-friendly software tool which is publicly available at http://grass.cgs.hku.hk/limx/witer.
Distributions of p-values for background mutation genes
The p-values of the proposed approach approximately followed uniform distribution. When the overall divergence from uniform p-values was measured as the mean log fold change (MLFC) of Tokheim et al (2016), the MLFCs of MutSigCV and OncodriveFML deviate from zero in all the cancers substantially, suggesting a large deviation from uniform distribution (Figure 3a). Consistent with the QQ plots, ITER and WITER had very low absolute MLFC (<0.02) in all the 11 cancers (Figure 3a). Moreover, as shown in the QQ plots (Figure 2), the p-values produced by WITER, were close to the uniform distribution (corresponding to null hypotheses) in all the cancers. This was also true for the unweighted version, iterative zero-truncated negative-binomial regression (ITER). Invalid uniform distribution of p-values is a tricky problem in almost all existing approaches (11). We chose two alternative approaches which achieved the best performance among 7 widely-used unsupervised tools (11)for the comparison, MusigCV (6) and OncodriveFML (9). The MutSigCV produced deflated statistical p-values in 10 cancers except that it produced proximately uniform distribution p-values in melanoma (MEL) (Figure 2). The OncodriveFML also produced deflated statistical p-values in all the 11 cancers (9) (Figure 2).
Significant genes identified in the 11 cancers with relatively large mutation number
We compared the number of significant genes detected by the 4 un-supervised approaches (MutSigCV, OncodriveFML, ITER and WITER) in the 11 cancer datasets. Instead of following the conventional “pancancer” (all cancers) evaluation strategy (11), we made the comparison for individual cancers, a more challenging scenario. The significant genes are determined according to a widely adopted cutoff in cancer-driver gene analysis, FDR<0.1(11). WITER always detected the largest number of significant genes in the 11 cancers among the 4 approaches (Figure 3b). ITER detected the second largest number of significant genes in 10 out of the 11 cancers. MutSigCV can be ranked at the third place according to the number of significant genes. The OncodriveFML detected the minimal number of significant genes in all the 11 cancers although it also integrated function prediction score, CADD (12). Compared to MutSigCV, WITER detected at least 8 extra significant genes in all cancers. The extra significant gene number increased to be at least 13 when comparing to OncodriveFML. WITER detected at least 12 more significant genes than ITER in 9 out the 11 cancers (bladder urothelial carcinoma, breast invasive carcinoma, colorectal adenocarcinoma, uterine corpus endometrial carcinoma, kidney renal clear cell carcinoma, lung adenocarcinoma, melanoma, ovarian carcinoma and stomach adenocarcinoma), suggesting the prior weights of frequent mutation potential at variants have great potential to improve the statistical power. Note that all the subjects in the testing cancer datasets were excluded from the COSMIC database to avoid circulating issues when building the prior weights for WITER.
Cancer consensus significant genes in the 11 cancers
We further checked the significant genes in the Cancer Gene Census (CGC) list (13) detected by the tools. Again, among the 4 methods, WITER always detected the largest number of genes in CGC list (Figure 3c). It detected at least 10 more CGC genes for 6 cancers than ITER. Anyhow, ITER was still the second-best method according to the number of CGC genes. It detected more CGC genes than MutSigCV in 10 cancers while it reported more CGC genes than OncodriveFML in all the 11 cancers. The OncodriveFML reported the smallest number of CGC genes in 10 cancers. Note we did not compare the percentage of the CGC genes in the total significant genes because the number of total significant genes by OncodriveFML were too few. Moreover, it should be noted that significant genes beyond CGC list are not necessarily spurious driver genes although a higher number of CGC genes is a strong sign of higher power. Take two non-CGC genes for examples. The AJUBA gene (p=8.1E-8 in head and neck cancer) is involved in the regulation of NOTCH/CTNNB1 signaling and is an important driver gene of head and neck cancer (14), (15). TLR4 (p=1.1E-4 in stomach adenocarcinoma) is an important member of Toll-like receptor (TLR) pathway and mutations in the gene may disrupt innate immune signaling and promote a microenvironment that favors tumorigenesis (16) and it was associated with gastric cancer in independent samples (17).
Unique significant genes by individual approaches
We also compared the number of unique significant genes by different tool. WITER detected the largest number of unique significant genes (FDR≤0.1) in the tested cancer types, which were insignificant and would be ignored by MutSigCV and OncodriveFML (Figure 3d). This was also true for the unique significant CGC genes by WITER(Figure 3e). WITER detected in total 267 unique significant genes and 133 unique CGC genes for all 11 cancers. Each cancer had at least 8 unique significant genes (Figure 3d). The colorectal adenocarcinoma (COAD) had the largest number of unique significant genes by WITER, 44, among which 16 genes were CGC genes. For example, CTNNB1 is a well-known driver gene for colorectal adenocarcinoma (18). It had 11 non-synonymous somatic mutant alleles in the colorectal adenocarcinoma samples. WITER calculated a p-value 1.44E-10 at this gene. The p-values by MutSigCV and OncodriveFML were 0.001 and 0.51 respectively. In contrast, MutSigCV detected no unique significant genes (FDR≤0.1) in 6 cancers and ≤3 unique significant genes in 4 cancers. The only exception was the lung adenocarcinoma for which MutSigCV detected 8 unique significant genes out of the 16 significant genes (FDR≤0.1). Many of the 8 unique significant genes had either long coding region or multiple synonymous mutations or close chromatin states. After correcting for the explanatory variables, WITER produced an insignificant p-value. For example, FBN2 has 77 non-synonymous or splicing mutant alleles in the lung cancer patients and MutSigCV gave a p-value 4.28E-07. However, it had a 9.1 kb coding region, 10 synonymous mutant alleles and close chromatin state (scored 9), WITER gave an insignificant p-value 0.25 for the excess of corrected non-synonymous or splicing mutant alleles. Similarly, OncodriveFML also detected very few unique significant genes in each of the 11 cancers compared to WITER. In the comparison, we ignored ITER because all significant genes by ITER were also significant by WITER. These results suggest that the WITER has great power to detect many potential driver genes in many cancers, which might be ignored by widely-used alternative methods.
Moreover, we also investigated the enrichment significance of known cancer related genes in the unique significant genes by WITER. Using the 19198 protein coding gene as the population size and 699 CGC genes as the number of success states, we performed enrichment analysis by hypergeometric distribution test. As shown in Table S2, the unique driver genes by WITER in all the 11 cancers were significantly enriched with the CGC genes (p<5.56E-8). In addition, we also performed a rough in-silico validation for all genes by searching literature co-mentioning the gene symbols and the specific cancer names in titles and abstracts of papers from the NCBI PubMed database by July 10, 2018. As it is very time-consuming to check the hit papers for all coding genes, we drew a random gene set of the same size for each cancer and performed Fisher’s exact test. Due to the small random sample size, it was much more conservative than the hypergeometric distribution test. The genes with three or more hit papers were counted. In the random gene set, most cancers had zero counts. The p-values were <0.01 in 7 cancers, showing a significant enrichment of cancer-related genes in the unique driver genes by WITER compared to the random gene sets. Noted that for less-studied cancers the significance tends to be less significant as well. This may be the reason why the significance varied from cancers to cancers. Nonetheless, the two analyses convincingly suggested the unique significant genes by WITER were enriched with many functionally important genes for the corresponding cancers
Rescued significant genes in small samples by an alternative tool
We investigated the scenario in which the significant genes missed by a tool in a small sample can be rescued by another tool. We randomly drew 6 sub-samples of half size from the largest dataset, breast invasive carcinoma dataset, and detected cancer driver-genes by the three tools, MutSigCV, OncodriveFML and WITER. As shown in Table S8, MutSigCV detected 11 significant genes on average in sub-samples with half of the breast invasive carcinoma sample. In the same sub-samples, WITER rescued 4 genes on average, which were detected in the full sample by MutSigCV. Similarly, WITER rescued 3 genes on average, which were missed in half sample but were detected in the full sample by OncodriveFML. Using half of the breast invasive carcinoma sample, WITER detected similar number of significant genes (FDR<0.1) as it did in the full dataset. This was larger than that detected by MutSigCV and OncodriveFML in the full dataset. Moreover, MutSigCV and OncodriveFML rescued less than 1 gene on average which were missed in half sample and were detected in the full sample by WITER. This comparison shows WITER has enhanced power in small samples to detect driver genes that would be missed by alternative methods due to the small sample sizes.
Performance in 23 cancer datasets with relatively small samples
Another important advantage of WITER is its ability to detect cancer-driver genes in small samples with a usage of reference samples. We applied the approach to 23 cancers of small samples. We deliberately used two reference samples with very low and high background mutation rates to investigate how WITER is sensitive to the reference samples. The low background mutation rate cancer was the breast invasive carcinoma, and the high one was the melanoma. Four evaluations were carried out. First, the usage of the reference datasets substantially improved the distribution of p-values, compared to the analysis without reference samples. According to the QQ plots (Figure S1), the p-value distributions of the background genes (FDR>0.1) with reference samples were very close to the uniform distributions. In contrast, the p-values of the background genes without reference sample were weird and did not follow the uniform distribution. Second, WITER detected significant genes even for cancers with very small sample size (See the results in Table 1). Among the 21 cancers with one or more significant genes (FDR≤0.1), 5 cancers had less than 50 subjects, e.g., B-cell lymphomas (n=26), small cell lung carcinoma (n=30), and cervical carcinoma (n=37). Third, it seemed the background mutation rate had a simple influence on the number of significant genes or statistical power. As expected, the low background mutation rate reference sample led to more significant genes than the high one. Moreover, we noted that almost all significant genes according to the high background mutation rate reference sample were also significant according to the low background-mutation rate reference sample. Therefore, the false positive findings can be easily controlled by using a high background mutation rate reference sample in practice although this may increase the false negatives. Anyhow, the overlapping percentage of the significant genes were also generally high. For four cancers (acute myeloid leukemia, prostate adenocarcinoma, pancreatic adenocarcinoma and low-grade glioma) with at least 15 significant genes have 100%, 93%, 83% and 67% overlapped significant genes based on breast invasive carcinoma and melanoma reference samples respectively. Finally, WITER detected much more significant genes than ITER again (Table S3). WITER detected 5 to 23 more significant genes in 9 cancers than ITER regardless of different reference samples. These results suggest that the WITER is also powerful for datasets of small sample and the detected significant genes are not very sensitive to the reference datasets.
It should be also noted that extra significant genes according to the low background mutation rate reference sample are not necessarily false. For instance, MYCN was significant driver gene of neuroblastoma based on the breast invasive carcinoma reference (p=5.06E-8) but insignificant based on melanoma reference(p=0.0012). Actually, MYCN is a well-known driver gene of neuroblastoma (19). Anyhow, to reduce false positive results rigorously, we used the conservative results, i.e. significant genes according to the melanoma reference sample, for the subsequent analysis.
Analysis of explanatory variables for predicting background somatic mutations
We further investigated the contribution of the 6 explanatory variables to prediction of background mutations in the regression models (See coefficients and p-values in Table 2). The coding region length and number of mutant alleles at synonymous variants of a gene were the top two explanatory variables in terms of their statistical significance. Their p-values were extremely small in all the testing cancers. As expected, a gene having longer coding region and more synonymous variants(20) tended to have larger number of mutant alleles at non-synonymous variants and splicing variants in background genes, n. Interestingly, the significant p-values at both explanatory variables under the same model implied their independent contribution although they were also correlated (Spearman correlation≈0.4-0.5 in cancers). The replication time (measured in HeLa cells) was also positively related with n in most of cancers. This is consistent with the biological assumption that high replication leads to more somatic mutations (21) (22). The coefficients of constraint missense Z scores (23) were also positive in most of cancers, suggesting a gene with high de novo mutation potential in germline cells tends to have more somatic mutations as well. There were 2 explanatory variables, expression (averaged across 91 cell lines in the Cancer Cell Line Encylcopedia) and HiC (measured from HiC experiments in K562 cell), having negative coefficients. This is consistent with findings by Lawrence et al. (2013) in which genes with lower expression tended to have more somatic mutations (22). The negative coefficient of HiC implied that a gene with more densely packed DNA also tended to have less number of somatic mutations in cancer cells (24).
The zero-truncated negative binomial model outperforms other models
We also compared the performance of the zero-truncated negative binomial model with three alternative widely-used models for fitting the mutation counts. The three models are Poisson distribution model, negative binomial distribution model, zero-truncated Poisson distribution models respectively. It turned out the zero-truncated negative binomial model had the smallest Akaike information criterion (AIC) values in all the 11 cancers, suggesting it is the best fitting model for the counts of somatic mutations among the four models (Table 3). The zero-truncated Poisson distribution was the second best model although its averaged AIC values was still 3364 larger than that of the zero-truncated negative binomial distribution. For negative binomial distribution or Poisson distribution, the zero-truncated versions were much better than the original versions. For the negative binomial distribution, the averaged AIC value in 11 cancers of the zero-truncated ones was 3267 smaller than the un-truncated ones. The averaged AIC value of the zero-truncated Poisson distribution was 819 smaller than the un-truncated Poisson distribution. This implies that it is critical to exclude the influence of the zero-counts when constructing a regression model. A well-fitted model for mutation counts at background genes led to more accurate residues for evaluating the excess of mutations in a gene.
The numbers of significant genes are more related with the number of mutations than sample size
We also investigated factors influencing the number of significant genes among the 34 cancers by WITER, which implies factors affecting the power in real data. The number of significant genes was highly related with the number of somatic variants. In a linear prediction model, the number of somatic variants had a good prediction on the number of significant genes, with a coefficient of determination R2, 0.36 (Figure S2). According to the prediction model, 57,000 somatic variants were needed to detect 30 significant genes. Because mutation rates are different in cancers, the corresponding sample sizes for such amount of mutations vary from cancers to cancers. Given the ratio of somatic variant number to sample size (Table S6), over 900 samples are needed to accumulate 57,000 variants in breast invasive carcinoma, kidney renal clear cell carcinoma, and ovarian serous cystadenocarcinoma. In contrast, for four cancers, less than 250 samples are sufficient, lung adenocarcinoma, melanoma, lung squamous cell carcinoma and bladder urothelial carcinoma. Compared to the number of somatic variants, sample size had less influence on the number of significant genes. In a linear regression model, coefficient of determination of sample size was only 0.17 (Figure S3). These results imply that the power of WITER may be determined by both sample size and somatic mutation rate.
The comprehensive landscape of driver-genes at 32 different cancers
WITER detected one or more significant genes in 32 cancers according to FDR<0.1. The total number of unique genes was 247 (See details in the Supplementary Excel File 1). Seventy-six genes occurred in two or more cancers. As expected, TP53 was the most common significant genes (in 27 cancer types), followed by PIK3CA, KRAS, FBXW7, NRAS, CTNNB1 and BRAF, each of which is associated with 10 or more cancer types. Four cancers had over 40 significant genes, colon adenocarcinoma (COAD), uterine corpus endometrial carcinoma (UCEC), melanoma (MEL) and stomach adenocarcinoma (STAD). Most of the predicted driver genes are previously reported for the corresponding cancers. Interestingly, multiple PCDHA genes were significant in five cancers. Although the significance at multiple genes probably were probably caused by the highly overlapped coding regions, it at least suggested PCDHA gene family is associated with the cancers. PCDHA genes encode a family of cadherin-like cell surface proteins for cell-cell adhesion. There have been no studies showing its somatic mutations contribute to tumorigenesis. However, DNA hypermethylation on PCDHAs were detected in multiple cancers including prostate cancer(25) and small-cell lung cancer (26). In the in-silico validation in the NCBI PubMed, 21 cancers had 70% significant genes with hit papers (Summarized in Table S4).
Cancer clusters according to overlapped significant genes
According to multiple overlapped significant genes(Table S7), cancers were clustered into groups (Figure 4). Consistent with a recent study(2), some cancers in a group had either similar tissue or similar cell of origins. A group contained 4 blood cell related cancers, multiple myeloma(MM), diffuse large B-cell lymphoma(DLBCL), chronic lymphocytic leukemia (CLL), acute myeloid leukemia (LAML). DLBCL and LAML had a uniquely overlapped gene, EZH2, which had been widely studied for both diseases (27), (28). In another group, two nervous system related cancers, low grade glioma (LGG) and glioblastoma multiforme (GBM), had 7 overlapped significant genes and formed a sub-group. The two female cancers, uterine corpus endometrial carcinoma (UCEC) and breast invasive carcinoma (BRCA) had 14 overlapped significant genes and formed a sub-group. Moreover, there were also multiple sub-groups which did not look so related biologically. For example, in a group, lung squamous cell carcinoma(LUSC) and head and neck squamous cell carcinoma(HNSC) had 9 overlapped genes and formed a sub-group. There have been multiple studies suggesting that the two types of tumors had similar pathological features (29), (30). In another group, ovarian serous cystadenocarcinoma (OV) and bladder urothelial carcinoma (BLCA) had 8 overlapped genes. The prostate adenocarcinoma (PRAD) and pancreatic adenocarcinoma (PAAD) had even 15 overlapped genes and formed a sub-group. These high overlapping patterns imply pathogenic connection of different cancers although larger samples and more experiments are needed to investigate the possible mechanistic link.
Genes significant only in an individual cancer
Besides the overlapped genomic signatures for clustering cancers, it is also interesting to find out the unique significant genes of a cancer for the characterization. Among the 34 cancers, we found 23 cancers having one or more unique significant genes (See details in Table 4 and Table S5). Cancers with more significant genes tended to have more unique significant genes, implying their high heterogeneity. For instance, for the 7 cancers with over 30 significant genes, each had over 10 unique significant genes (See the cancer names in Table 4). The numbers of hit papers in the NCBI PubMed database are summarized in Table 4 and the detailed PubMed IDs of hit papers are listed in Table S5.
Take colon adenocarcinoma for an example, it had 20 unique significant genes. Three genes (CXCR4, TCF7L2 and GNAS) had over 10 hit papers, suggesting that they are well-studied genes for colon cancer. For instance, there are at least 100 papers mentioning the relation of CXCR4 with colon adenocarcinoma. CXCR4 encodes a CXC chemokine receptor specific for stromal cell-derived factor-1. A very recent study suggested that the level of CXCR4 can determine the effects of ALDH1A3 on in vitro proliferation and invasion in colon cancer (31). Zheng et al suggested CXCR4 may play a key role in colorectal adenocarcinoma progression via the mediation of tumor cell adhesion (32). However, in literature, CXCR4 was also associated with lymphoplasmacytic lymphoma (33). However, in the 34 collected cancers, 31 cancers had totally insignificant p-values (p>0.18) expect for multiple myeloma (p=0.0025) and lung squamous cell carcinoma (p=0.046). These results suggest mutant CXCR4 may at least have relatively larger susceptibility to colon cancer than to most of other cancers. The gene TCF7L2 encodes a transcription factor 7-like2/transcription factor 4 that plays a key role in the Wnt/β-catenin signaling pathway (34) and was reported to be associated with colon adenocarcinoma (35). Similarly, except for a suggestively significant p-value in stomach cancer (p=8.47E-4), it had totally insignificant p-values in the other 32 cancers (p>0.42) although it was also reported to be associated with other cancers, such as breast cancer (36). The gene, GNAS, encodes guanine nucleotide binding protein (G Protein) and alpha stimulating activity polypeptide complex. In human protein atlas (HPA, http://www.proteinatlas.org/ENSG00000087460-GNAS/tissue) database, this gene has been categorized as a cancer-related gene. (See the PubMed IDs of related papers in Supplementary Table 4). In addition, 7 genes had one or several hit papers related to colon adenocarcinoma. For example, the unique significant gene PCBP1 (p=6.17e-07) had two hit papers. One paper suggested that PCBP1 was a molecular marker of Oxaliplatin (a standard treatment for colorectal adenocarcinoma) resistance in colorectal adenocarcinoma and a promising target for colorectal adenocarcinoma therapy (37). The other paper suggested that PCBP1 was responsible for stabilizing gastrin mRNA which was highly expressed in colorectal adenocarcinoma (38). PCBP1 represses autophagy-mediated cell survival and inhibition of tumor cell autophagy and the PCBP1 upregulation may be an effective therapeutic strategy to colon tumor with low PCBP1 expression (39). LIFR(p=4.90e-04) had 4 hit papers and encodes protein that belongs to the type I cytokine receptor family. One of the studies used the meta-analysis with public cancer methylome data verified the colon cancer specificity of LIFR promoter methylation (40). Kim et al suggested that a missense mutation of LIFR rs3729740 may be useful as a biomarker for predicting whether metastatic colorectal adenocarcinoma patients were sensitive to relevant target regimens(41).
Five cancers had only one unique significant gene, chronic lymphocytic leukemia (CLL), cervical carcinoma (CESC), multiple myeloma (MM), rhabdoid tumor (RHAB) and thyroid carcinoma (THCA). The genes of two cancers (RHAB and CLL) had multiple hit papers. The unique significant gene of RHAB, SMARCB1, had even 100 hit papers. SMARCB1 encodes part of a complex that relieves repressive chromatin structures to allow the transcriptional machinery to access its targets effectively. It is a known tumor suppressor gene, and its mutations have been associated with malignant RHAB (42). After first discovered in RHAB, mutant SMARCB1was subsequently found in multiple cancers (e.g., renal medullary carcinoma) (43). Almost all the cancers with mutant SMARCB1 were characterized by the presence of ‘rhabdoid cells’ featuring large vesicular nuclei and large paranuclear filamentous cytoplasmic inclusion (44). The gene FGFR1 for Astrocytoma had 16 hit papers. FGFR1 encodes a fibroblast growth factor receptor. Studies suggested genomic alterations in FGFR1 can account for most pathogenic alterations in low-grade neuroepithelial tumors, including pilocytic astrocytomas (45). The unique significant gene of CLL, MYD88 (p=1.34E-09), had 40 hit papers. MYD88 encodes cytosolic adapter protein, an essential signal transducer in the interleukin-1 and Toll-like receptor signaling pathways (46). Except for a suggestively significant p-value in diffuse large B-cell lymphoma (DLBCL) (p=8.47E-4), it had totally insignificant p-values in the other 32 cancers (p>0.42). In fact, a lot of studies have suggested MYD88 as a driver gene for the two cancers (47), (48). The single unique genes of three other cancers had no hit papers by far in PubMed and are subject to validation in the future.
Pathway analysis of driver genes among multiple cancers
We performed pathway enrichment analysis by DAVID 6.7 (https://david.ncifcrf.gov/, Figure S7) among 8 cancers which had more than 10 significant driver genes. Two pathways, ErbB signaling pathway and Neurotrophin/Trk signaling, were enriched by the predicted driver-genes in most cancers. The ErbB signaling pathway was significant in all the 8 cancers. ErbB family of receptor tyrosine kinases (RTKs) are involved in intracellular signaling pathways to regulate diverse biologic responses, including proliferation, differentiation, cell motility and survival (49). Several well-known cancerous pathways, such as MAPK pathway and PI-3K pathway, are the downstream of the ErbB receptors (50). This result suggests that ErbB signaling pathway may have a common driver role in genesis of many tumors. The neurotrophin signaling pathway was significant in 6 cancers. The Neurotrophin/Trk signaling is regulated by connecting a variety of intracellar signaling cascades, which include MAPK pathway, PI-3K pathway, and PLC pathway, transmitting positive signals like enhanced survival and growth (51). Therefore, Neurotrophin/Trk signaling may be commonly involved in the development of multiple tumors. Another interesting pattern was that MEL (melanoma) and STAD (stomach adenocarcinoma) had many shared pathways although they only had 10 shared predicted driver genes. Quite a few of the shared pathways are related to immune response, such as Chemokine signaling pathway, Fc epsilon RI signaling pathway, and Natural killer cell mediated cytotoxicity (52). Besides, another shared pathway, focal adhesive, plays essential roles in important biological processed including cell motility, proliferation, differentiation (53). The shared pathways provide interesting clues to common pathogenesis of cancers, which are subject to be investigated by more experiments.
Discussion
Accurately modeling counts of somatic mutations at background genes in small samples has long been a fundamental technical challenge in genomic characterization of cancer-driver genes (2, 11). The proposed approach, WITER, has four unique advantages to address this issue. First, it has an advanced model, zero-truncated negative binomial regression, to fit the number of somatic mutations at background genes. In small samples, one often sees an inflation of zero mutation genes and overdispersion of mutation counts. Particularly, the inflated zero values make it difficult to fit the distribution of genomic counts by conventional distributions. The zero-truncated negative binomial distribution subtly circumvents both the zero inflation and the overdispersion issues. This is also the reason why zero-truncated negative binomial model always achieved the minimal AIC among four alternative models. Moreover, the deviance residuals in the regression model lead to statistically valid p values for rapid analysis. This solves the common problem of alternative methods that time-consuming simulation or permutation is needed to obtain valid p-values (Figure 3b and c) for hypothesis tests. Secondly, the iteration of the regression diminishes the influence of driver genes on the background mutation models. The progressive exclusion of likely driver genes results in a “purer” background mutation model, in the contribution of somatic mutations from driver genes will become less prominent. Third, the method also has an advantage of using an independent sample as reference to boost statistical power. When the sample size is small, there will be limited number of mutations and the resulting model for background genes will be unstable. This may be a common problem of existing cancer-driver gene tests. The usage of reference sample solves the problem of small samples. More importantly, we found the number of significant genes detected by WITER was generally not sensitive to the reference samples in most cancers (Table 1). Finally, it can impose prior weights to treat potential driver mutations and passenger mutations differently. Due to the iterative design, the resulting model will be fitted mainly by the passenger mutations.
The weighting scheme contributed much to the finding of extra significant genes. In the real data analysis of 34 cancers, the weighted version (WITER) always detected more significant genes and cancer-consensus genes than the unweighted version (ITER) and two other widely-used methods (MutSigCV and OncodriveFML). Note OncodriveFML also integrated functional impact scores (e.g., CADD). We also have demonstrated the WITER and ITER had similar and valid p-value distributions (Figure 3a), implying the imposed weights do not statistically invalidate the p-values. In the present study, we simply used the predicted highly frequent (n>15) mutation potential in COSMIC database as prior weights with the assumption that highly frequent somatic mutations in cancer cells are more likely to be driver-mutations. Although it is hard to say the assumption works for every somatic mutation, the prior weights substantially enhanced the power in all cancers (Figure 3b and c). Theoretically, this property should be applicable to other types of prior weights. The more accurate weights in terms of the probability of being a cancer driver-mutation, the more improved power WITER will have.
We compared the proposed method with two widely-used and well-performed approaches(11), both of which belong to the unsupervised category. Another category of methods is the supervised approaches for detecting cancer driver genes. According to Tokheim et al (2016)(11), the supervised method 20/20plus outperformed the unsupervised methods (including MutSigCV and OncodriveFML) in terms of p-value distributions and the number of significant genes. However, a supervised strategy has learning bias toward the training samples in nature(54). If the training sample is not representative of all sample, the trained model will have low power for new samples. This would be particularly true for cancers because of their high genetic heterogeneity(5). Second, the 20/20plus also used many common genomic features of a gene (e.g., evolutionary conservation, predicted functional impact of variants, and gene interaction network connectivity) in the prediction (11). Although the usage of common genomic features will add information to prioritize common cancer-driver genes, it also runs the risk of diluting the information in local sample for identifying unique cancer driver genes, which would be important for a precision diagnosis and treatment of the tested cancers. Finally, the 20/20plus resorted time-consuming permutation procedure to generate p-values for statistical test. In contrast, the WITER and ITER are much faster than 20/20plus because it calculates p-values directly. Nevertheless, we also made additional comparisons between WITER and 20/20plus approach in the 11 cancers. In 4 cancer datasets, the p-value distribution of background ground genes produced by WITER were a little bit closer to uniform distribution than that by 20/20plus. (See QQ plots in Figure S4). WITER also detected more significant and cancer-consensus genes in 6 out of the 11 cancers (See details in Figure S5) and rescued more missed genes by other tools(See details in Table S8). These results suggest WITER may have slightly better performance than 20/20plus generally.
Applying the powerful approach, WITER, we generated a landscape of driver genes in 32 cancers. Although it would be more informative if samples were larger, the landscape has already showed some common and unique patterns of cancers. According to the overlapped significant genes, we saw many cancer subgroups, say UCEC and BRCA. Although the underlying mechanism of common driver genes between the different cancers remains elusive, highly overlapped genes in these subgroups unlikely occur by chance. Identifying the common causes of a subgroup cancers may help find the pathogenic and metastatic relationship of the cancers and facilitate development of common treatments. On the other hand, the unique significant genes in the landscape have potential to characterize individual cancers. There are 24 cancers with one or more unique significant genes. Although some significant genes of a cancer may become no longer unique after sample size get increased, it may at least imply a relatively high susceptibility of the gene in a reported cancer, say SMARCB1 for rhabdoid tumor. Clearly, some of these unique significant genes will be very helpful for characterizing the tumor types, say MYD88 for lymphoma (55), which is important for precision diagnosis and treatment of the tumors.
Methods and Materials
The unified statistical framework
The unified statistical framework has a three-tier structure to examine driver genes by using somatic mutations in cancer cells (See the diagram in Figure 1). The first tier is an iterative zero-truncated negative-binomial regression which estimates expected non-synonymous and splicing mutation counts of a gene under background mutation model. The second tier is a weighting scheme to generate and integrate prior weights for prioritizing variants of high somatic mutation potential in cancer samples. The third tier is a schedule of adopting independent reference samples to stabilize the regression model in small samples. These methods work from different angles to improve the model of background mutations in passenger genes for a more powerful evaluation of driver genes.
Tier I: The iterative zero-truncated negative-binomial regression
We proposed an approach, ITER, to estimate somatic mutation counts of each gene on the genome. The difference between the observed mutation counts and the estimated counts of a gene measures the excess of somatic mutations at a gene in a cancer. The mutation types of interest are non-synonymous mutations and splicing mutations, which assumes a gene with significant excess of these types of mutations may confer selective growth advantage in cancer as a driver gene(6). Denote the mutant allele counts at a non-synonymous or a splicing variant j in a background gene i as ci,j and the total alleles of mi variants in this gene is, yi. We assume yi follows a negative binomial (NB) distribution(56): where μi is the expected number of mutations and θ is a dispersion parameter. The probability mass function (PMF) is , where Γ( ) is the gamma function and x=0,1,2.….
As somatic mutation is a rare event, many genes have no somatic mutations in a sample of typical size. While the negative binomial model includes a probability mass at x=0, this is often much less than the number of genes with no somatic mutations in real data. This inflation of zeros makes it very difficult to fit the negative binomial distribution to the counts of somatic mutations. Therefore, we proposed to use a zero-truncated negative binomial (TNB) distribution to model the mutant allele counts of background gene i. The PMF of TNB is:
Based on the TNB, we constructed a generalized linear regression model to estimate mutant allele of non-synonymous or splicing variants in a gene i by 6 covariables:
η = log(μi) = β0 + β1 × [x1, number of mutant alleles at synonymous variants]
+β2 × [x2, length of unique coding region]
+β3 × [x3, constraint score for de novo mutation potential]
+β4 × [x4, expression in cell lines in the Cancer Cell Line Encylcopedia]
+β5 × [x5, DNA replication timing in HeLa cells]
+β6 × [x6, long − range chromatin interactions by HiC in K562 cell],
where log(μi) is the link function and the β0, …, β6 are the coefficients.
The number of mutant alleles at synonymous variants was counted in the local samples. The length of unique coding region was calculated from gene model defined by a reference gene model database, RefGene. The gene’s constraint scores were from Samocha et al (2014) (23). The last three covariates were adopted from MutSigCV (6). The expression values were averaged expression across 91 cell lines in the Cancer Cell Line Encylcopedia (CCLE). The replication time of a gene was measured in HeLa cells, ranging from 100 (very early) to 1000 (very late). The chromatin state of a gene was measured from HiC experiments in K562 cells, ranging approximately from −50 (very closed) to +50 (very open). Because some covariables had missing values, a widely-used nonparametric missing value imputation method based on Random Forest, missForest, in a R package was used to impute missing values. This model is also open for other covariables as long as they can improve the prediction accuracy.
The parameters can be estimated by maximum likelihood with a quasi-Newton method. In our study, we called the maximum likelihood method in a R package countreg (https://r-forge.r-project.org/R/?group_id=522) to estimate the coefficients. The dispersion parameter θ is jointly estimated with the regression coefficients, β0, …, β6. The model is fitted only for genes with non-zero counts.
With the established model, the logarithm of the expected mutation counts, , at non-synonymous or splicing variants in a gene i can be calculated by: where are the fitted coefficients.
Given the fitted parameters, the probability of zero mutation gene i is: .
Under zero-truncated model, the raw residual at gene i is:
The deviance residual of the model at gene i is: where sign(x) is the standard sign function, ll(μ, θ) is the natural logarithm of the likelihood function of the zero-truncated negative binomial distribution, and is the estimated mean given the observed count yi and estimated of a saturated model, obtained by solving the following equation:
The deviance residuals are further standardized by the estimated mean and standard deviation of the deviance residuals,
In real data analysis [Figure 2, 3a and S1], we demonstrated the standard normal distribution can be used to approximate the corresponding p-values of the standardized deviance residual éi: where Φ(x) is the cumulative distribution function of the standard normal distribution.
The assumption is that most majority of genes are background passenger genes. So, the ITER models the expected mutant alleles at somatic non-synonymous or splicing variants under null hypothesis. A large éi means the observed number of somatic mutations is much larger than the expected number of mutations from the null hypothesis.
In order to reduce distortion of driver genes in the null-hypothesis regression model, we proposed to perform the regression under an iterative procedure:
Step 1: perform ITER to calculate p-vales for all genes.
Step 2: exclude significant genes by a cutoff corresponding to false discovery rate (FDR)≤0.1.
Step 3: perform ITER to calculate p-vales for the retained genes.
Step 4: repeat Step 2 and 3 until there is no extra significant genes according to FDR≤0.1.
The fitted ITER model in the last iteration is closest to the null hypothesis model and is then used to re-calculate deviance residuals and p-values of all genes (including the ones excluded during iteration).
Tier II: The weighted iterative zero-truncated negative-binomial regression
We further extend ITER to a WITER, which integrates prior weights at variants to boost power. Assume a variant j of gene i has a score, si,j, ∈ [0,1], implying its cancer driver potential. We bin si,j as an integer score, wi,i, by the ceiling function of si,j/0.1, i.e., wi,j = ⌈si,j/0.1⌉. The integer score is then used as prior weights for the variant. The ITER is a special case of WITER when wi,j = 1 for all variants. The weighted mutation allele count is:
We now assume the weighted counts ỳi follow a negative binomial (NB) distribution: where is the expected weighted counts of mutations and is a dispersion parameter of the NB distribution. After replacement of original counts (yi) with weighted counts (ỳi), the same iterative zero-truncated negative-binomial regression procedure is carried out to test whether a gene has excess of weighted mutant alleles at non-synonymous or splicing variants.
In the present study, we built a model to predict high-frequency cancer driver potential to use as prior weights, in the form of a random forest (ensemble of 500 decision trees) trained by a large cancer somatic mutation database, COSMIC (V83). To avoid circular bias, all subjects (n=7,916) in our collected testing samples of the 34 cancers were excluded from COSMIC database. We collected 4,320 somatic mutation variants occurring over 15 times in primary cancer tissues to constitute a positive variant set in COSMIC(V83). A negative control variant set containing 258,846 somatic mutation variants was randomly sampled from the COSMIC as well. Each of the control variant occurred only once in primary cancer tissues. The predictors at each variant include 19 deleterious or conservation scores from the database dbNSFP v3.5 (57), (e.g., MutationTaster2 (58) and FATHMM (59), see the names of all tools in Supplementary Figure S6). The area under the receiver operating characteristic curve of the random forest model was 79%, which was much better than a multivariate logistic regression model and individual predictors (Figure S6). The random forest prediction scores, s, ranged from 0 to 1. For variants without prediction scores due to missing values, the average score in the gene was used.
Tier III: ITER or WITER with reference samples in analysis for small cancer samples
When the number of somatic variants is small (say <28,000), it is difficult to build a stable regression model. However, note that the key idea of ITER and WITER is to build a prediction model for background passenger genes. When the mutation rates of passenger genes of two cancers are similar, it may be workable to integrate background genes of one cancer for the other cancer. We proposed a reference sample strategy for building a stable ITER or WITER model in small sample dataset. This is carried out into two stages.
At the first stage, the above ITER or WITER is used to produce p-values for excess of somatic mutations at genes in a reference sample which have sufficient number of variants. Genes with p-values less than a very loose cutoff, say FDR 0.8, are excluded.
At the second stage, the somatic mutations of retained genes are integrated with the local small sample and input into ITER or WITER to build a new regression model. The excess of somatic mutations and corresponding p-values at genes are calculated based on the new model.
Performance comparison with alternative tools
There have been multiple tools for detecting cancer-driver genes(2). According to an evaluation study(11), 2 tools (MutSigCV(6) and OncodriveFML(9)) and 1 tool (20/20plus(11)) had relatively better performance were chosen for comparisons in the present study. We compared their p-value distributions and number of significant genes with ITER and WITER. The MutSigCV was developed based on the background mutation rate while the OncodriveFML and 20/20+ were developed based on the ratio-metric. According to another classification, MutSigCV and OncodriveFML used an unsupervised strategy to predict cancer driver genes while 20/20plus used a supervised strategy. So, the unsupervised methods were chosen as the main targets for the performance comparison. MutSigCV is a powerful method for detecting genes mutated more often than expected by chance. It used a local regression model to estimate the expected mutant alleles by multiple genomic features of a gene in cancer cells including its expression level, replication time and 3D chromatin interaction capture (HiC). The online MutsigCV version (1.2) was used through the Broad website (http://genepattern.broadinstitute.org/gp/pages/index.jsf?lsid=MutSigCV). The recommended exome coverage file (https://genepattern.broadinstitute.org/gp/data/xchip/gpprod/shared_data/example_files/MutSigCV_1.3/exome_full192.coverage.txt) and gene covariates file (https://genepattern.broadinstitute.org/gp/data//xchip/gpprod/shared_data/example_files/MutSigCV_1.3/gene.covariates.txt) were used. OncodriveFML is a method designed to estimate the accumulated functional impact bias of tumor somatic mutations in both coding and non-coding genomic regions, based on a simulation process. It used CADD scores to predict mutational impacts. The results were produced according to coding DNA sequence (CDS) regions. The genome reference and CDS files were downloaded from the website (https://bitbucket.org/bbglab/oncodrivefml) as the authors recommended. The default parameters of OncodriveFML were used to produce the results. The 20/20 plus is a machine-learning-based method integrating multiple features to predict driver genes, including sample mutational clustering, evolutionary conservation, predicted functional impact of variants, mutation consequence types, gene interaction network connectivity, etc. It used computer simulation to generate p-values for statistical significance. The 20/20plus v1.1.3 was downloaded and installed according to the website tutorial (http://2020plus.readthedocs.io/en/latest/index.html). The necessary files were also collected as the authors suggested (http://probabilistic2020.readthedocs.io/en/latest/tutorial.html#gene-bed-file and http://probabilistic2020.readthedocs.io/en/latest/tutorial.html#pre-computed-scores-optional). The data were analyzed by a pipeline to predict the cancer drivers under the default parameters. The 20/20 plus took 1.5 hours on average to analyze a dataset on a computer with 12 CPU (1.70GHz) cores and 64G RAM. The number of simulations was 10000.
Evaluation metrics in the performance comparison
We adopted four evaluation metrics for performance comparison, number of significant genes predicted, overlap with Cancer Gene Census (CGC) (13), observed vs. theoretical p values, and unique significant genes by a tool. The former 3 were also major metrics in an evaluation framework of cancer driver gene prediction method(11). The CGC dataset contained 699 manually curated cancer genes by Dec. 16, 2017. The departure of p-values from uniform distribution was measured by the mean absolute log2 fold change (MLFC) (11). The widely-used cutoff, Benjamini and Hochberg FDR 0.1, was used to report significant genes. A valid statistical test should lead to a MLFC close to zero in background (or passage) gene. We also used the distribution of Quantile-Quantile (QQ) plot to examine the distribution of p-values at the tail of small p-values.
Dataset of somatic mutations
We partitioned a curated full somatic mutation dataset by Tokheima and colleagues (11) into 34 sub-datasets according to the cancer types (See Table S6). Eleven sub-datasets contain 2,800 or more variants and were called relatively larger cancer dataset throughout the paper. Their sample sizes ranged from 142 to 1093. The ratios of variant number to sample size in the 11 cancers ranged from 50 to 327. The 23 other cancers with less number of variants are called relatively smaller cancer sets. The names, variant number and sample sizes of all cancers can be seen in Table S6.
In silico validation by PubMed search
We used PubMed search function to coarsely validate the relation between significant genes and a specific cancer. The underlying assumption is that the papers co-mentioning the gene and the cancer name in the title or abstract are likely to implicate the relatedness between the gene and the cancer. The more hit papers, the more likely the gene is related to the cancer. This is a quick in-silico validation although it may be rough. We employed the web application programming interfaces (APIs) of PubMed to execute the search. The search link was, http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=&“DiseaseNames(inlcudin g homonymies)”[tiab]%29+AND+“GeneSymbol (including RefSeq mRNA IDs)” [tiab]. The search responsed PubMed ID and relevant data of the papers, if available, in extensible markup language (XML).
Tool availability
The statistical framework has been implemented into a Java standalone application and is available at http://grass.cgs.hku.hk/limx/witer/.
Contributions
M.L., J.K., L.J., Y.Z. and P.S. conceived the study. M.L. oversaw all aspects of the study. L.J. M.L. and J.K. developed the models. J.Z., S.D. and C.L. performed extensive computational analyses for performance comparison. Y.Z. and K.T. analyzed landscape of cancer-driver genes. M.L. and L.J. wrote the manuscript with input from J.Z. and S.D. All authors edited and approved of the final manuscript.
Competing interests
The authors declare no competing interests.
Supplementary Figures and Tables
Acknowledgements
This work was funded by National Natural Science Foundation of China (31771401), Science and Technology Program of Guangzhou (201803010116), Hong Kong Health and Medical Research Fund (02132236). Hong Kong General Research Fund 17124017, 17121414 and TRS T12C-714/14-R. We thank Tokheima and colleagues for sharing the high-quality curated somatic mutations in 32 cancers from multiple resources.
Footnotes
↵# The authors wish it to be known that, in their opinion, the first three authors should be regarded as joint First Authors.