Abstract
Localised variation of somatic mutation rates affects diverse functional sequence elements in cancer genomes through poorly understood mutational processes. Here, we characterise the mutational landscape of 640,000 gene regulatory and chromatin architectural elements in 2,421 whole cancer genomes using our new statistical model RM2. This method quantifies differential mutation rates and signatures in classes of genomic elements via genetic, trinucleotide and megabase-scale effects. We report a detailed map of localised mutational processes affecting CTCF binding sites, transcription start sites (TSS) and cancer-specific open-chromatin regions. This includes a pan-cancer indel depletion in open-chromatin sites, a TSS-specific mutational process correlated with mRNA abundance in core cellular and cancer-associated processes, a subset of hypermutated, constitutively active CTCF binding sites involved in chromatin architectural interactions, and an enrichment of signature SBS17b in CTCF sites in gastrointestinal cancers. We also detect genetic driver alterations potentially underlying localised mutation rates, including RAD21 amplifications and BRAF mutations associating with mutagenesis of CTCF binding sites, and SDHA amplifications indicative of frequent lung cancer mutations in open-chromatin sites. Our framework and the catalogue of localised mutational processes provide novel perspectives to cancer genome evolution and its implications for oncogenesis, tumor heterogeneity and cancer driver gene discovery.
Introduction
Genomes accumulate somatic mutations through exposure to exogenous and endogenous mutagens. Subsets of these mutations confer cells select proliferative advantages and drive oncogenesis while most mutations are functionally neutral passengers 1,2. The discovery and validation of driver mutations is a major focus of cancer genomics research 3-5, however the genome-wide landscape of passenger mutations is also instrumental to our understanding of oncogenesis and tumor evolution 6,7. Somatic mutation rates show complex genomic variation at multiple resolutions 8. In megabase-scale genomic windows, variations in mutation rates are associated with transcriptional activity, chromatin state and DNA replication as late-replicating and untranscribed regions are often more mutated than regions of early replication and highly expressed genes 9-12. At a single base pair resolution, certain trinucleotides are preferentially mutated through processes of carcinogen exposures, defective DNA repair pathways, and aberrant DNA replication 13-15. For example, mutational signatures detected in metastatic tumors are informative of the treatment history of patients 16,17. In concert, these large-scale and nucleotide-level variations contribute to tumor heterogeneity and leave a footprint of tumor evolution and its cell of origin 12,18,19.
Complex variation in mutation rates is also apparent across intermediate genomic resolutions spanning hundreds to thousands of nucleotides. This encapsulates diverse functional genomic elements such as exons, transcription factor binding sites (TFBS) and chromatin architectural elements 8,20. DNA bound by nucleosomes and transcription factors (TFs) show increased mutation rates in cancer genomes 21-23. Active promoters in melanoma are enriched in UV-induced C>T somatic mutations resulting from differential activity of nucleotide excision repair influenced by DNA-binding of regulatory proteins 24,25. Likewise, DNA-binding sites of the master transcriptional regulator and chromatin architectural protein CTCF (CCCTC-binding factor) are enriched in somatic mutations in multiple cancer types 21,26-28. In contrast, certain genomic elements such as chromatin-accessible regulatory regions 29 and protein-coding exons 30 have been shown to carry relatively fewer mutations due to increased DNA repair activity. While most such non-coding mutations are likely functionally neutral passengers induced by localised mutagenesis, some regulatory elements at the high end of the mutation frequency spectrum may undergo positive selection due to their effects on cancer phenotypes. For example, the mutation hotspot in the TERT promoter creates a TFBS of the ETS TF family that leads to constitutive activation of TERT and enables replicative immortality of cancer cells 31-33. Recent studies have catalogued candidate non-coding driver elements in gene regulatory and chromatin architectural regions of the cancer genome with functional validations of novel elements 34-36 and highlighted the convergence of non-coding mutations on molecular pathways and regulatory networks 37,38. Thus we need to characterise localised mutational processes to understand the evolution of the somatic genome and the effects of carcinogens and endogenous mutational processes, but also to evaluate the effects of positive selection in the non-coding genome. However, few dedicated computational methods exist to analyse mutation rates at the local resolution of the genome. As a result, there is a lack of large-scale analyses of the local mutation landscape in pan-cancer WGS datasets, leaving the genetic and environmental determinants poorly understood.
Here we developed a new statistical framework that quantifies the activity of mutational processes and signatures on specific classes of non-coding elements of the cancer genome. Our model considers local sequence context, megabase-level somatic mutation rates and genetic covariates to control for variation at the trinucleotide and megabase resolution while isolating site-level effects. We performed a systematic analysis of local mutation rate variation in three classes of gene-regulatory and chromatin architectural genomic elements across 2,500 whole cancer genomes of the ICGC/TCGA Pan-cancer Analysis of Whole Genomes (PCAWG) project 3. We found a pervasive mutation enrichment at these functional non-coding elements that was characterised by specific mutational signatures and transcriptional and pathway-level activities of these elements in select cancer types. We detected statistical interactions of local mutagenesis and recurrent genomic alterations that suggest potential genetic mechanisms driving the underlying mutational processes. Our computational framework and systematic analysis reveals the diversity of mutational processes in functional non-coding elements of the cancer genome and their roles in somatic genome evolution, drivers of cancer phenotypes and molecular heterogeneity.
Results
A statistical framework for quantifying localised mutagenesis in cancer genomes
We implemented a statistical model, Regression Models for Regionalised Mutations (RM2), to quantify the local activity of mutational processes in whole cancer genomes in elements each spanning dozens to hundreds of nucleotides (Supplementary Figure 1). The model considers a genome-wide set of genomic elements such as TFBSs detected in thousands to hundreds of thousands of loci using chromatin immunoprecipitation with DNA sequencing (ChIP-seq) and similar techniques. The model uses negative binomial regression to evaluate whether the genomic elements of interest are collectively subject to a different mutation rate compared to control sequences upstream and downstream of these elements. Somatic single nucleotide variants (SNVs) and small insertions-deletions (indels) are analysed, however, the model can be extended to somatic structural variant breakpoints and germline variation. The model considers four types of information to evaluate local mutation rates: a) nucleotide sequence content of genomic elements and control sequences representing the potential space for mutagenesis, grouped by 96 trinucleotide signatures and one indel signature (nPosits), b) the counts of observed somatic mutations in the cohort of tumors (nMut) in genomic elements and control sequences also grouped by 96 trinucleotide signatures and one indel signature required to derive mutation rates (triNucMut), c) megabase-scale background mutation rates of elements computed across the cohort of tumors (MbpRate) to account for large-scale mutation correlates such as transcription and chromatin state, and d) an optional binary cofactor (coFac) to stratify tumors based on their genetic makeup (e.g., presence of a driver mutation) or clinical information (e.g., tumor subtype or stage). Genomic elements and upstream and downstream control regions are pooled into a user-defined number of bins of equal size based on their megabase-scale mutation rates (ten bins by default). Elements and flanking control sequences are distinguished using the binary cofactor isElement. The full model is written as follows:
nMut ∼ NegBin(offset(log(nPosits)) + triNucMut + log1p(MbpRate) + coFac + isElement).To determine whether the mutation rates of the genomic elements differ from the rates of the flanking sequences given trinucleotide-level and megabase-scale covariates, we evaluate the significance of the cofactor isElement using a likelihood-ratio test. Significant and positive coefficients of this cofactor indicate increased mutation rates in genomic elements relative to flanking controls, while negative coefficients indicate a depletion of mutations. Similarly, we can discover potential genetic or clinical interactions with localized activity of mutational processes. Given a binary subgroup classification of tumors (coFac), we evaluate the significance of its interaction with local mutation rates (isElement:coFac). Positive coefficients of the interaction indicate that the mutation rates in a clinical or genetic tumor subgroup are elevated when accounting for the overall differences of the subgroups. Lastly, we extend the mutation rate analysis to subclasses of mutations by allowing only specific classes of mutations to be included in the mutation counts (nMut), for example those of specific DNA strand, transcriptional direction, or COSMIC mutational signatures. We evaluated the performance of our method using simulated datasets, power analysis and parameter variations as described below (Supplementary Figure 2).
Localised mutation rates in gene-regulatory and chromatin architectural elements of cancer genomes
To study localised mutation rates in gene-regulatory and chromatin architectural elements of the genome, we analysed the pan-cancer dataset of 2,514 whole cancer genomes of the PCAWG project 3. We individually analysed 25/35 cancer types with at least 25 samples as well as the pan-cancer set containing all tumors of the 35 types (Supplementary Figure 3A). 69 hypermutated tumors were excluded to avoid confounding effects. Three classes of genomic elements were analysed: 119,464 CTCF binding sites conserved in at least two cell lines in the ENCODE project 39, 37,309 transcription start sites (TSS) of protein-coding genes from the Ensembl database (GRCh37), and a pan-cancer set of 561,057 open-chromatin sites detected across 410 primary tumors of The Cancer Genome Atlas (TCGA) project (i.e., ATAC-seq sites) 40 (Supplementary Table 1A). In total, the analysis included 640,023 unique loci representing 3.9% (120.5 Mbps) of the human genome. In addition to total mutations, we grouped mutations by DNA strand (Watson or Crick), transcription status (forward, reverse, bidirectional or absent), reference and alternative nucleotide pairs, and COSMIC mutational signatures of single base substitutions (SBS) inferred in the PCAWG study 14. Indel mutations were pooled with SNVs and also analysed separately. To obtain a conservative analysis, we excluded a small fraction of tumors as outliers (31 or 1.3%) where even single-sample RM2 analysis revealed highly significant differences in local mutation rates in any of the three classes of sites (FDR < 0.001) (Supplementary Figure 3B). The final pan-cancer analysis studied 23.0 million mutations including 1.61 million indels detected in 2,421 genomes of 35 cancer types.
We first focused on the mutational profiles of CTCF DNA-binding sites. Overall mutation rates in CTCF binding sites were significantly higher in liver hepatocellular carcinoma (Liver-HCC) (RM2 FDR = 4.3 × 10−14, fold-change (FC) = 1.10), esophageal adenocarcinoma (FDR = 1.1 × 10−20, FC = 1.19) and stomach adenocarcinoma (FDR = 8.3 × 10−14, FC = 1.19). The pan-cancer cohort also showed a significant enrichment, likely due to pooled effects of certain cancer types (FDR = 8.1 × 10−27, FC = 1.07). These initial results confirm earlier reports of elevated mutation rates in CTCF DNA-binding sites 21,27,28 and validate our computational model. Smaller increases in mutation rates were also detected in melanoma, pancreatic and breast cancer (FDR ≤ 0.02). Additional signals were observed in specific subgroups of mutations. Grouping the mutations by reference and alternative nucleotides revealed a strong enrichment of T>G mutations (e.g., Liver-HCC, FDR = 1.4 × 10−33, FC = 1.56), T>C and T>A mutations. Interestingly, intergenic CTCF binding sites were particularly enriched in mutations in several cohorts. We then asked whether CTCF binding sites were characterised by specific COSMIC SBS mutational signatures. Esophageal and stomach cancers showed a strong enrichment of SBS17 mutations (SBS17b: FDR = 5.6 × 10−23, FC = 1.73; and FDR = 5.2 × 10−8, FC = 1.66) (Figure 1B), while this was not observed in Liver-HCC and other cancer types with frequent CTCF binding site mutations. The etiology of SBS17b is unknown, however it has been linked to acid reflux and oxidative damage to DNA in gastro-esophageal cancers 41,42, and a similar mutational signature found in metastatic tumors has been associated with the effects of nucleoside metabolic inhibitor chemotherapies capecitabine and 5-FU 16. Our analysis suggests that effects of these mutagens may be especially active at insulator and chromatin architectural elements bound by CTCF in tissues of the digestive system. This analysis refines the annotation of a mutational process acting on the DNA-binding sites of CTCF in a large dataset of whole cancer genomes.
Transcription start sites (TSS) of protein-coding genes were significantly enriched in mutations in the pan-cancer cohort (FDR = 1.2 × 10−35, FC = 1.07) and in cohorts of 13/25 cancer types, most prominently in melanoma (FDR = 1.6 × 10−17, FC = 1.18), breast, head, lung, ovary and pancreatic cancers (FDR ≤ 10−4). Stronger enrichments were detected among C>G and C>T mutations. Mutational signature analysis highlighted an elevated rate of the aging-associated signature SBS5 in the pan-cancer cohort (FDR = 1.5 × 10−11, FC = 1.06) and cohorts of four cancer types. The signature SBS3, associated with defects of homologous recombination-based DNA damage repair, was observed in the pan-cancer cohort (FDR = 5.7 × 10−19, FC = 1.14) (Figure 1C) as well as breast and ovarian cancers (FDR ≤ 10−3, FC ≥ 1.12). Spontaneous formation of endogenous double strand breaks at promoters has been associated with the pause and release of RNA polymerase II and linked to chromosomal translocations in cancer 43, suggesting a mechanism of TSS-specific mutagenesis in cancer genomes. Further mutational signature enrichments were identified in specific cancer types, such as the ultraviolet light signatures in melanoma (e.g., SBS7a: FDR = 3.6 × 10−16, FC = 1.24) and the tobacco-associated signature SBS4 in the two cohorts of lung cancers (FDR ≤ 10−3, FC ≥ 1.10). The enriched mutational signatures at TSSs match the major mutagens and exposures of these cancer types, indicating an overall increased vulnerability of TSSs to mutational processes. This analysis confirms previous reports of increased mutation rates in promoters in melanoma 24,25 and indicates that TSS-specific mutational processes are widely active in the pan-cancer context.
Pan-cancer open chromatin regions defined as ATAC-seq profiles of primary tumors were also enriched in mutations in 11/25 cancer types and the pan-cancer cohort, especially among C>G mutations (e.g., pan-cancer, FDR = 1.0 × 10−27, FC = 1.08) (Figure 1D). However, the effect sizes of mutational enrichments were more modest compared to CTCF binding sites and TSSs, potentially due to mixed effects of SNVs and indels: mutation enrichments in open-chromatin sites were primarily driven by SNVs, while in contrast, indel mutations were significantly depleted in the pan-cancer cohort and in 9/25 cancer types, indicating that the open chromatin environment or the binding of regulatory elements may be protective of the mutational processes responsible for generating indels. For example, in uterine adenocarcinoma, 1,762 indel mutations were observed in open-chromatin sites while 2,172 were expected according to RM2 (FDR = 5.3 × 10−9, FC = 0.82) (Figure 1E). This indel depletion appeared relatively stronger in intergenic sites of open chromatin. Further, mutational signature analysis indicated reduced activity of the aging-related signature SBS1 in open-chromatin sites, apparent in eight cancer types and the pan-cancer cohort (e.g., colorectal adenocarcinoma, FDR = 3.8 × 10−5, FC = 0.90). However, analysis of these pan-cancer open chromatin regions is better powered compared to the analysis of TSS loci due to a larger number of sites, thus smaller deviations in local mutation rates were detectable. Our findings contrast an earlier report that indicated broadly decreased mutation rates at chromatin-accessible regulatory elements derived from cell lines 29. Comparison of localised mutation rates at TSS loci and open-chromatin sites indicate distinct properties of localised mutagenesis acting on proximal and distal regulatory elements of the genome.
We extended this analysis to benchmark our model and evaluate statistical power. First, to assess model calibration and false positive rates, we analysed a PCAWG dataset of simulated variant calls designed to approximate neutral genome evolution 4. As expected, analysis of simulated data did not reveal any significant differences in mutation rates in the three classes of elements (all FDR > 0.05). Quantile-quantile analysis of P-values confirmed that the model is well calibrated for true and simulated mutations (Supplementary Figure 2A). Second, we evaluated the statistical power of our model by analyzing down-sampled subsets of liver cancer genomes and CTCF sites (Supplementary Figure 2B). For example, the mutation enrichment in CTCF sites was detectable 80% of the time when sampling 75 genomes and 75,000 CTCF binding sites. Third, we varied the parameter corresponding to the normalised width of genomic elements for the three classes (Supplementary Figure 2C). Local differences in mutation rates were robustly detected for multiple element widths. However, mutation enrichments in chromatin architectural elements bound by CTCF were generally focused at narrower regions (50 bps) compared to gene-regulatory elements at TSS and open-chromatin regions (200 bps). This is reflective of their different element widths and indicative of differences in underlying mutational processes. In summary, our method provides a versatile and well-calibrated framework for analyzing localised mutation rates and mutational processes in cancer genomes.
Functional associations of gene-regulatory and chromatin architectural mutations
We asked whether the increased mutational load at TSSs and open-chromatin sites correlated with transcriptional activity of target genes. To precisely quantify mRNA abundance in different cancer types, we used matching RNA-seq data in the PCAWG project 44. TSS-specific mutation rates were studied in six bins of genes grouped by median mRNA abundance in 19/25 cancer types. TSS-specific mutation rates were significantly increased in highly expressed genes in eleven cancer types (FDR < 0.05) and the pan-cancer cohort (FDR = 2.2 × 10−50, FC = 1.16) (Figure 2A) (Supplementary Table 1B). Association of transcriptional activity and TSS-specific mutation rates was the strongest in melanoma and ovarian, lung, pancreatic, and breast adenocarcinoma where genes with the highest mRNA abundance were strongly enriched in TSS-specific mutations (FDR ≤ 10−4, FC ≥ 1.15) (Figure 2B). Of note, the top gene bin was highly variable in mRNA abundance (e.g., 11-7,100 RPKM-UQ in the pan-cancer cohort). In silenced genes, TSS loci consistently showed no significant differences in mutation rates relative to flanking controls (FDR > 0.05), however the overall mutation rate in sites and controls was higher, potentially a result of lower rates of transcription coupled repair in closed chromatin. This analysis highlights a localised mutational process in gene promoters apparent in multiple cancer types that is potentially driven by transcriptional initiation or TF binding at core gene-regulatory promoter elements.
Compared to TSSs, mutation rates in open-chromatin sites showed only limited associations with mRNA abundance (Figure 2B). The open-chromatin sites of the most highly expressed genes were significantly associated with higher mutation rates, however the effect sizes of fold-changes were consistently lower (e.g., pan-cancer, FDR = 9.3 × 10−17, FC = 1.04). We performed a down-sampling analysis of open-chromatin sites and found no significant associations of highly expressed genes and localised mutation rates when considering random subsets of sites comparable with the analysis of TSSs. Since the number of open-chromatin sites is considerably larger, the observed localised increase in mutation rates can be partly attributed to the improved statistical power that allows detection of smaller effects (Supplementary Figure 4). Thus, our observed transcription-dependent mutational process appears to be more active at promoters while the broader spectrum of proximal and distal pan-cancer regulatory elements is less affected.
To understand the functional associations of TSS mutations, we performed a pathway enrichment analysis by adapting RM2 to gene sets of GO biological processes and Reactome molecular pathways 45. This allowed us to test our hypothesis that promoters enriched for non-coding mutations are concentrated in specific biological processes. We found 546 unique pathways and processes exhibiting elevated TSS-specific mutation rates (FDR < 0.1) (Figure 2C) (Supplementary Table 1C), the majority of which were found in the melanoma cohort (70%) while 28% of pathways were found in more than one cancer type. Translation, ribosome biogenesis and RNA processing were among the largest groups of pathways with increased TSS-specific mutation rates. This is expected as the translational machinery is ubiquitously active in proliferating cells and includes many highly expressed genes. Cancer-related processes and pathways were also enriched in TSS mutations, for example mitotic cell cycle, apoptosis, DNA repair, angiogenesis, developmental and immune response processes as well as druggable signalling pathways (e.g., MAPK, Wnt, Notch) were identified in multiple cancer types. The pathway analysis augments our observation of frequent TSS mutations associating with increased transcription and highlights a variety of core cellular processes with high baseline transcription and associated localised mutagenesis active in many cell types. Furthermore, the significant enrichment of TSS-specific mutations in genes of cancer-related processes suggests that some more frequent non-coding mutations at individual promoters are functional and may contribute to cancer driver mechanisms and tumor heterogeneity by altering gene-regulatory circuits in molecular pathways and interaction networks 37.
We asked whether the elevated mutation rates in CTCF binding sites were also associated with their functional characteristics. We used the extent of conservation of DNA-binding across cell types as a proxy of site functionality across an extended set of 162,209 CTCF binding sites catalogued in 70 cell lines in ENCODE. We grouped CTCF binding sites into five equal bins based on the number of cell lines where the site was detected, with entirely tissue-specific CTCF binding sites in bin one and the constitutively bound subset of sites detected in most or all cell types in bin five (median conservation in 67 cell lines; range 52-70) (Figure 3A). Strikingly, the localised elevations of mutation rates were exclusively detected in the subset of constitutively bound CTCF binding sites, as observed in eleven cancer types and the pan-cancer cohort (FDR = 3.3 × 10−89, FC = 1.23) (Figure 3B) (Supplementary Table 1D). In contrast, all other bins showed no significant enrichment of mutations (FDR > 0.05). Additional cancer types with frequent CTCF binding site mutations were uncovered in this more focused analysis, including colorectal and pancreatic cancer and lymphoma (BNHL) (FDR ≤ 10−6, FC ≥ 1.18). This extends the established pattern of CTCF mutation enrichment to a subset of pan-tissue genomic elements and highlights the utility of RM2 in detecting mutational patterns. Since CTCF is known as the master regulator of genome architecture, we incorporated a HiC dataset of chromatin long-range interactions 46 to interpret the mutation enrichments at highly conserved CTCF binding sites. We found that these constitutively bound CTCF binding sites were strongly enriched in chromatin loop anchor elements: 30% (9,393/32,442) of the sites of bin five were located within 1 kbps of anchor midpoints (3,973 or 12% expected, Fisher’s exact P = 0). Contrarily, the majority of CTCF binding sites that were only detected in a few cell types showed no deviation from expected mutation rates and an expected distribution with respect to chromatin loop anchors. This indicates the mutational process primarily affects the subset of CTCF sites that are constitutively bound in most cell types and participate in chromatin architectural and gene regulatory interactions 46,47. Conservation is a property of functionally integral CTCF binding sites, which upon disruption, can lead to changes in underlying chromatin architecture and gene regulation 48 and is associated with activation of proto-oncogenes 26.
Localized mutation rates associate with recurrent mutations of BRAF, RAD21 and SDHA
To find potential genetic mechanisms of localised mutagenesis, we asked whether any recurrent mutations in tumor genomes could explain the observed mutation rate increases in the three classes of genomic elements. We considered 14 cancer types with 15 driver genes with frequent SNVs and indels detected using the ActiveDriverWGS method 34, and 60 recurrent copy-number amplifications detected in the PCAWG project using the GISTIC2 method 49 (Supplementary Figure 5). We found 27 driver alterations that positively interacted with local mutation rates, including 26 recurrent copy-number amplifications and one driver gene with SNVs and indels (RM2 FDR < 0.05, interaction P < 0.05) (Figure 4A) (Supplementary Table 1E-F). Significant interactions were detected in seven cancer types, mostly in breast adenocarcinoma (17) and one to three interactions each in stomach, pancreatic, ovarian, lung and liver cancers and melanoma. Six genomic amplifications were identified twice (3q26.2, 6q21, 8q22.2, 8q23.3, 8q24.21, 9p24.1). The majority of interactions (20/27) were found for CTCF binding site mutations. The large number of statistical interactions seen in breast cancer suggests that overall chromosomal instability may contribute to mutational processes at CTCF binding sites 28 in this cancer type. However, several specific examples of potential genetic factors contributing to local mutagenesis were also found.
Driver mutations in BRAF in melanoma were associated with an increased mutation rate at CTCF binding sites (FDR = 5.5 × 10−5, FC = 1.07, interaction P = 0.032). Comparison of driver-mutated and wildtype tumors confirmed the interaction: 30 tumors with BRAF mutations showed a significant increase in mutation burden at CTCF sites (FDR = 1.8 × 10−5, FC = 1.10) while the remaining 33 BRAF-wildtype tumors showed no significant change (FDR = 0.34, FC = 1.02) (Figure 4B). The activating V600E amino acid substitutions in the BRAF serine/threonine kinase and proto-oncogene define a druggable subtype of melanoma 50,51. Ectopic expression of V600E-mutant BRAF in epithelial cell lines was shown to induce DNA double strand breaks and reactive oxygen species 52. In this cohort, 23 melanomas carried V600E substitutions and four additional tumors had V600K substitutions. Therefore, the melanomas defined by BRAF driver mutations may have an increased activity of the mutational process acting on CTCF binding sites.
To further map the potential genetic mechanisms of local mutagenesis, we determined the genes in the highlighted CNA regions that responded transcriptionally to genomic amplifications. We found 260 unique genes in 19 recurrently amplified regions that were up-regulated in tumors with amplifications (Wilcoxon test, FDR < 0.05) (Figure 4C) (Supplementary Table 1G). These associations were identified for all three categories of genomic elements and six cancer types were represented (breast, liver, lung squamous, ovary, stomach and pancreatic). Eleven known cancer genes were found in single cancer types (RECQL4, CCND1, SDHA, DEK, CDK12, ERBB2, JAK2, PDCD1LG2, BRAF, EXT1, PTK6) according to the Cancer Gene Census database. Notably, the cohesin subunit RAD21 was the only gene associated with localized mutagenesis in two cancer types. Thus amplification-driven activation of hallmark cancer genes may directly or indirectly affect the activity of localised mutational processes.
Amplification and transcriptional up-regulation of RAD21 located at 8q23.3 was associated with increased mutation rates in CTCF binding sites in stomach and breast cancers. RAD21 encodes a subunit of the cohesin complex that co-binds DNA with CTCF to orchestrate transcriptional insulation and chromatin architectural interactions 46,47. Stomach cancer genomes with amplifications of 8q23.3 showed a strong increase in mutations in CTCF binding sites (FDR = 3.1 × 10−12, FC = 1.24) while the increase was less significant in tumors with no amplification (FDR = 6.6 × 10−5, FC = 1.13; interaction P = 0.020) (Figure 4D). mRNA abundance of RAD21 in 8q23.3-amplified stomach cancers was significantly higher compared to non-amplified samples (FDR = 2.6 × 10−4, 46 vs. 27 median FPKM-UQ), suggesting that RAD21 expression is driven by the genomic amplification. This association was also observed in breast cancer where the genomes with 8q23.3 amplifications showed an increased mutation rate in CTCF binding sites (FDR = 5.3 × 10−5, FC = 1.08) that was not apparent in tumors lacking 8q23.3 amplifications (FDR = 0.36, FC = 0.98; interaction P = 5.0 × 10−4). RAD21 amplification was also associated with increased mRNA abundance in breast cancer (FDR = 1.4 × 10−7, 108 vs. 50 median FPKM-UQ), which is also indicative of poor prognosis 53. In the ENCODE dataset, 51% of CTCF binding sites were also co-bound by cohesin, representing 94% of the high-confidence RAD21 binding sites in ENCODE (60,636 / 64,483), significantly more than expected by chance (666 expected, binomial P = 0). Mutations in another cohesin subunit, STAG2, have been associated with specific mutational signatures and altered transcription at double strand break sites 54. In addition to RAD21, components of the general transcriptional machinery were amplified in 8q23.3 and showed increased mRNA abundance in breast and stomach cancers, potentially contributing to mutation rates in CTCF binding sites. These included MED30, which encodes a subunit of the Mediator transcriptional coactivation complex that interacts with cohesin to connect promoters and enhancers 55, as well as the transcription initiation factor subunit encoded by TAF2. Alternative explanations to CTCF binding site mutations were also apparent in our data. For example, interactions with the amplification of 8q24.21 encoding the MYC oncogene were identified in both breast and stomach cancer, however these amplification events did not associate with MYC up-regulation. This analysis suggests that the elevated mutagenesis at CTCF binding sites may be driven by the genomic amplification and up-regulation of core transcriptional and genome architectural machinery interacting with CTCF.
As another example, amplification of 5p15.33 was associated with increased mutation rates at open-chromatin sites in lung squamous cell carcinoma (Lung-SCC) (Figure 4E). 27 tumors with this amplification showed an elevated mutation rate in open-chromatin sites (FDR = 1.2 × 10−6, FC = 1.03) while 18 non-amplified tumors showed no significant difference (FDR = 0.10, FC = 1.01; interaction P = 0.037). The cancer gene SDHA and nine other genes in the region showed significant up-regulation in tumors with this amplification (FDR = 7.0 × 10−5, 19 vs. 13 FPKM-UQ). SDHA encodes a subunit of the mitochondrial succinate dehydrogenase (SDH) complex involved in cellular energy metabolism through the citric acid cycle and the electron transport chain. Germline mutations of the tumor suppressor SDHA and the genes encoding other SDH subunits predispose individuals to the neuroendocrine tumors pheochromocytomas and paragangliomas 56,57. Mutations and inhibition of SDH subunits are associated with increased oxidative stress, production of reactive oxygen species (ROS) and activation of the hypoxia-inducible factor HIF 58,59. These data suggest that the increased mutation rate in open-chromatin sites in Lung-SCC is associated with the genomic amplification and up-regulation of SDHA that leads to the destabilization of the SDH complex and results in increased oxidative damage in open-chromatin sites.
In summary, our analysis provides a catalogue of potential genetic mechanisms underlying localised mutation rate variations in cancer genomes. While a subset of these driver mutations and recurrent copy-number amplifications may be directly involved in processes of mutagenesis and DNA repair, others may represent tumor subtypes with specific exposures or endogenous factors.
Discussion
The cancer genome is molded by diverse mutational processes that continuously shape its broad megabase-scale features and the fine context of nucleotide signatures. Here we focused on the mutational processes of an intermediate scale that affect thousands of functional genomic elements each spanning dozens to hundreds of nucleotides. Using a novel computational framework, we mapped widespread enrichments and some depletions of mutations in gene-regulatory and chromatin architectural elements. These non-coding mutations associated with transcriptional and pathway activity, trinucleotide mutational signatures, and conservation of site activity across cell types. We found interactions with recurrent driver mutations and copy-number amplifications that provide hypotheses regarding the mechanisms of the underlying mutational processes, for example copy-number alterations and driver mutations in RAD21, SDHA and BRAF were indicative of increased mutation rates in CTCF binding sites and open-chromatin sites. In particular, the finding of RAD21 amplifications associating with CTCF binding site mutations fits with our observation of the constitutively active, hypermutated subset CTCF binding sites enriched at chromatin loop anchors, as cohesin and CTCF co-bind DNA to facilitate chromatin architectural interactions. Overall, we speculate that the localised mutations represent a functional continuum of passengers and drivers. On the one hand, the mutation rates of these genomic elements likely deviate from the background rates due to focal carcinogen exposures or interactions with transcriptional, DNA replication or repair machinery that make these sites more or less vulnerable to mutations. On the other hand, some of these functional elements may control transcription regulatory interactions or epigenetic states of genes and pathways involved in cancer and their excess mutations reflect positive selection. While it is unlikely that all functional elements of a specific class would positively impact oncogenic processes when mutated, specific subsets of elements may contribute to hallmarks of cancer as suggested by our pathway analysis. Our computational framework and the detailed catalogue of localised mutational processes and genetic interactions detected in a large pan-cancer dataset provides specific hypotheses for further study.
Our analysis has certain caveats and limitations. We analysed a generic catalogue of genomic elements that only provides limited representation of the primary tumors in our cohort. To address this limitation, we used matching RNA-seq data to stratify regulatory elements based on their activity in specific cancer types and also considered open-chromatin profiles of primary tumors in the first analysis of this kind. Future analyses will benefit from detailed multi-omics cohorts with matching genomic, transcriptomic and epigenomic profiles of individual tumors. Also, the current framework is designed for genomic elements of uniform width where analysis of elements of variable width, such as exons or non-coding RNAs, may lead to statistical biases. Further, our analysis suggests that different classes of gene-regulatory and architectural elements of the genome may be subject to localised mutational processes that have footprints of different sizes. Thus it is recommended to evaluate that input parameter of the method when analysing new genomic elements. Our method is designed to quantify localized differences of mutation rates acting on an entire class of genomic elements with thousands to hundreds of thousands of genomic loci. It is not powered to evaluate a single genomic element as a potential cancer driver and alternative methods should be used for this purpose. However, we have adapted our method to evaluate TSS-specific mutation rates in gene sets of representing biological processes and pathways with hundreds to thousands of genes.
Our study enables a number of future developments. Integrative analysis of whole cancer genome sequences and rich clinical and pathological profiles of tumors 17 may highlight associations of clinical variables and localised mutagenesis and lead to the discovery of novel WGS-based biomarkers. Considering patient lifestyle information, environmental exposures and germline variation in the analysis may elucidate the impact of carcinogens and endogenous DNA repair deficiencies. Our catalogue of genetic associations provides hypotheses on the mutational mechanisms that can be tested experimentally using genome editing and mutagenesis assays. Rare germline variants in the human population 60, de novo variants detected in genetic disorders 21 and the widespread somatic genome variation found in healthy tissues 61 provide further avenues to study mutational processes acting at functional non-coding elements. Our study enables a detailed annotation of localised mutational processes in whole genomes to decipher cancer driver mechanisms, molecular heterogeneity and genome evolution.
Methods
Regression models for Regionalised Mutations (RM2)
Local differences in mutation rates in functional genomic elements (i.e., sites) were evaluated using a negative binomial regression model we refer to as RM2. Single nucleotide variants (SNVs) and small insertions-deletions (indels) were analysed. The model simultaneously considers a collection of non-coding sites, such as regulatory elements that are commonly ∼10–1,000 bps in length and measured in ChIP-seq and related experimental assays in thousands to hundreds of thousands of genomic loci. Sites were uniformly redefined using their median coordinate and added sequences of fixed width upstream and downstream of the sites (e.g., ±25 bps or 50 bps around the midpoints of CTCF binding sites). Upstream and downstream flanking sequences of these sites were used as control regions to estimate expected mutation rates. Control regions were of equal width to sites such that the upstream and downstream regions combined were twice as wide as the sites. To account for megabase-scale variation in mutation rates, we computed the total log-transformed mutation count for each site within its one-megabase window (i.e., ±0.5 Mbps around site midpoint). Based on this estimate, all sites were distributed into ten equal bins (MbpRate). The value of ten bins worked well in our benchmarks and captured variation in smaller and larger cohorts of individual cancer types. However, custom values of this parameter can be used. Mutation rates for sites and flanking sequences for each bin were defined separately and sites and flanking sequences were distinguished by a binary cofactor (isSite). To avoid inflated counts, mutations affecting more than one site were counted once. Likewise, mutations affecting the flanking sequence of more than one site were also counted once. Sequence positions were counted separately by their trinucleotide context (nPosits) and expanded to three alternative nucleotides to account for the potential sequence space where such single nucleotide variants could occur (nPosits). The observed mutations in these contexts were also counted (nMuts) and a cofactor was used to add separate weights to different trinucleotide classes (i.e., reference trinucleotide and alternative nucleotide; triNucMutClass). Indels were counted under another entry in triNucMutClass such that all mutation counts were summed and the entire genomic space was accounted for. An optional binary cofactor (coFac) was included to allow the consideration of genetic or clinical covariates of localised mutation rates. To evaluate the significance of localised mutation rates in sites compared to flanking control regions, we first constructed a null model by excluding the term isSite:
Hnull : NegBin(nMut ∼ offset(log(nPosits)) + triNucMutClass + log1p(MbpRate) + coFac).The main model representing the alternative hypothesis of a site-specific mutation rate was constructed as follows:
Halt : NegBin(nMut ∼ offset(log(nPosits)) + triNucMutClass + log1p(MbpRate) + coFac + isSite).We extended our model to evaluate whether localised mutation rates differed between two subtypes of tumors, such as those defined by clinical annotations or genetic features using the term coFac. Trinucleotide sequence content, trinucleotide mutational signatures and megabase-scale covariations of mutation rates were computed separately for the two sets of tumors. To establish the associations of localised mutation rates and driver mutations, we added to the initial model the term isSite:coFac mapping the interaction of the tumor subtype and the cofactor distinguishing sites and flanking sequences.
Hcof : NegBin(nMut ∼ offset(log(nPosits)) + triNucMutClass + log1p(MbpRate) + coFac + isSite + isSite:coFac).We used likelihood ratio tests to compare the models and evaluate the significance of localised mutation rates (Halt vs. Hnull to evaluate the term isSite). Chi-square P-values from the likelihood ratio tests were reported for each analysis. We also reported coefficient values of the term isSite to characterise enrichment or depletion of mutations at sites relative to flanking controls for positive and negative values, respectively. The interactions of driver mutations and mutation rates were evaluated using likelihood ratio tests that compared the models Halt and Hcof. Only the models with significant positive coefficients were reported. The expected mutation counts were derived from each model by 1000-fold sampling of mutation counts from the negative binomial distribution informed by the fitted probabilities and theta values derived from the regression models. Fold-change values were derived by dividing median observed and expected mutation counts, and confidence intervals were derived using the 2.5th and 97.5th percentiles of sampled values. Chi-square P-values from the models were corrected for multiple testing using the Benjamini-Hochberg procedure where appropriate. Besides modelling total mutations in sites and flanking sequences, we evaluated mutations of multiple sub-classes, such as mutations stratified by transcriptional activity, COSMIC mutational signatures or DNA strands. Mutation subclass analysis was conducted as described above. The same megabase-scale mutation rates estimated for all mutations were used rather than those of the specific subclasses. The method is available at https://github.com/reimandlab/RM2.
Somatic mutations in whole cancer genomes
Somatic single nucleotide variants (SNVs) and short insertions-deletions (indels) in the genomes of 2,583 primary tumors were retrieved from the uniformly processed dataset of the Pan-cancer Analysis of Whole Genomes (PCAWG) project of the ICGC and TCGA 3. We used consensus variant calls mapped to the human genome version GRCh37 (hg19). We removed 69 hypermutated tumors with at least 30 mutations per Mbps, resulting in 2,514 tumors and 24.7 million mutations. We also removed tumors for which mutational signature predictions were not available in PCAWG. We analysed tumor genomes of the pooled pan-cancer cohort of multiple cancer types, and also 25 cohorts of specific cancer types with at least 25 samples in the PCAWG cohort. We excluded a small subset of tumors (31 or 1.3%) where localised mutation rates were exceptionally strong even when analysing one tumor genome at a time (FDR < 0.001, RM2). Based on our initial analyses, we found that the individual contribution of these tumors to the overall analysis would have caused overestimates of mutation rates. To enable this filtering, we performed tumor specific analyses for the three classes of sites (open-chromatin sites, binding sites of CTCF, and TSSs). We analysed each cohort of a cancer type separately and grouped the mutations according to tumor ID, allowing the model to learn an expected background mutation rate in the respective cohort and then test each tumor genome separately for localised mutation rates. To perform this single-tumor analysis in smaller cohorts within the PCAWG dataset (<25 tumors of a given type), we created a meta-cohort by pooling these smaller cohorts. After filtering hypermutated tumors, tumors without PCAWG signatures, and tumors with exceptionally strong signals of localised mutations, we derived a conservative final set of 2,421 genomes of 35 cancer types with 23 million mutations including 1.61 million indels. To evaluate the performance of our model, we also processed a dataset of simulated variant calls for the same set of tumors derived from the PCAWG project (i.e., the Broad dataset) 4.
Mutation features and signatures
In addition to evaluating total mutations, several classes of mutations were analysed separately. Mutations were mapped to C and T nucleotides and grouped by reference and alternative nucleotides (C>[A,G,T], T>[A,C,G]). Four additional classifications were developed. First, mutations were classified as located either on the Watson (w) strand if the original reference nucleotide was C or T, or the Crick (c) strand if the original reference nucleotide was A or G. Second, transcriptional activity and orientation of the mutated nucleotides was mapped based on the coordinates of protein-coding genes defined in the Ensembl database (GRCh37) using 500 bps flanking sequence at both ends of genes to account for transcriptional initiation and termination. We then classified mutations as forward-transcribed (F), reverse-transcribed (R), bidirectionally transcribed (B), or not transcribed (O). This initial classification did not include information on tissue-specific transcription and was augmented using matching tumor-specific mRNA abundance data, as described below. Third, mutation strand and transcription status were combined into eight categories (w_[F,R,B,O] and c_[F,R,B,O]). Fourth, we classified mutations by the trinucleotide signatures of single base substitutions (SBS) that were derived earlier using the SigProfiler software in the PCAWG project 14. We assigned each mutation to its most probable signature in the given patient tumor based on its trinucleotide context. For model evaluation, these five major categories of mutations were also derived for the dataset of simulated variant calls.
Chromatin architectural and gene-regulatory genomic elements
We performed a systematic analysis of three classes of genomic elements: DNA-binding sites of CTCF (CCCTC-binding factor) detected in multiple human cell lines, transcription start sites (TSS) of protein-coding genes, and open-chromatin sites (ATAC-seq sites) detected in human primary tumors. CTCF binding sites were retrieved from the ENCODE project 39. Sites observed in only one cell line were removed, resulting in 119,464 sites across 70 cell lines. TSS loci of protein-coding genes were retrieved from Ensembl Biomart (GRCh37) and filtered based on location of standard chromosomes (1-22, X, Y), resulting in 37,309 TSSs of 18,710 genes. Open-chromatin sites of 410 primary tumors defined by ATAC-seq were retrieved from the TCGA study 40. We used the pan-cancer set of sites in the GRCh37 genome as defined in the study and filtered sites on non-standard chromosomes and those lacking defined coordinates in GRCh37. This resulted in 561,057 open-chromatin sites. For the mRNA-based analysis of mRNA abundance described below, we further filtered open-chromatin sites based on their target genes as defined in the original study. We selected the subset of open-chromatin sites where predicted target genes were available, mapped the gene symbols to ENSG identifiers using Ensembl Biomart (GRCh37, release 100), and removed open-chromatin sites with missing or ambiguous gene symbols, resulting in 438,948 sites annotated to 17,116 genes. Throughout the study, the three classes of sites were normalised to uniform width based on median coordinates. CTCF binding sites were defined using 50 bps (±25 bps) windows around the midpoint of sites. Midpoints of TSS loci were defined in the Ensembl database and we used a 200 bps (±100 bps) window around the TSSs. Open-chromatin sites were also defined using a 200 bps (±100 bps) window around site midpoints. A systematic analysis was used to explore various values of the site width parameters and the final selection was based on the strength of signal and consistency (Supplementary Figure 2). For CTCF binding site analysis, we also retrieved DNA-binding sites of the cohesin protein RAD21. These sites were also retrieved from the ENCODE dataset and those only observed in single cell lines were filtered, resulting in 64,483 high-confidence sites. The majority (94% or 60,636) of high-confidence RAD21 sites overlapped with high-confidence CTCF sites (i.e., those observed in at least two cell lines in ENCODE). We evaluated the enrichment of RAD21 binding sites in CTCF binding sites with a binomial test, using RAD21-bound fraction of the human genome (kbps) as the expected probability, and total sequence coverage of CTCF sites (kbps) and RAD21-cobound CTCF sites (kbps) as the numbers of tries and successes, respectively.
Grouping gene-regulatory sites by mRNA abundance
TSS and open-chromatin sites were analysed in groups based on the mRNA abundance of associated genes in matching tumors. TSS target genes were retrieved from the Ensembl database and target genes of open-chromatin sites were retrieved from the original TCGA study. This analysis was carried out in 19 cohorts of cancer types with at least 20 tumor samples with mRNA and WGS data, as well as the pan-cancer cohort of all cancer types. We used the uniformly processed PCAWG RNA-seq dataset 44 (RPKM-UQ) and applied the same filtering of tumor samples described previously and excluded non-coding genes. Additionally, we discarded a subset of genes with duplicated HGNC symbols as well as the genes for which TSS or open-chromatin sites were not mapped. This resulted in mRNA measurements for 20,042 protein-coding genes in 1,267 tumor transcriptomes. Next we derived the gene lists grouped by median mRNA abundance. Six exclusive lists of genes were compiled for each cancer type based on mRNA abundance values in the matching samples, including silent genes (median zero RPKM-UQ) and five lists of non-silent genes of equal size grouped into 20% bins. For the pan-cancer analysis, we binned genes using median mRNA abundance in the entire RNA-seq dataset.
Grouping CTCF binding sites by cell type specificity
To analyse CTCF binding sites by their tissue and cell type specificity, we grouped all 162,209 CTCF binding sites of the ENCODE dataset into five equally sized bins based on the number of cell lines where the sites were detected. To interpret these CTCF sites, we retrieved chromatin loops in eight cell lines from a Hi-C study 46, used a ±1,000 bps window around loop anchor midpoints to define narrower versions of loop anchors, and counted the number of CTCF binding sites in each bin overlapping these loop anchors. We used a Fisher’s exact test to evaluate the enrichment of CTCF binding sites at loop anchors among the CTCF binding sites with constitutive activity across cell types (i.e., the 5th bin of CTCF sites).
Analysis of localised mutation rates in gene-regulatory and chromatin architectural elements
First, we evaluated the localised mutation rates in CTCF binding sites, TSSs and cancer-specific open-chromatin sites (i.e., TCGA ATAC-seq sites) for the pan-cancer cohort and all cohorts of selected cancer types. Total mutations and mutations grouped by COSMIC signatures, mutation and transcription strand, and reference/alternative allele were analysed. Indel mutations were analysed as part of total mutations and also as a separate group. Results were adjusted for multiple testing using the Benjamini-Hochberg false discovery rate procedure and filtered (FDR < 0.05). We also analysed the simulated variant call set using the same pipeline and found no significant results, as expected (FDR < 0.05). Results of the systematic analysis were visualised as a dot plot. FDR values in the main dot plot were capped at 10−32 for visualisation purposes. To visualise localised mutation rates, all sites were pooled, aligned using median coordinates and trimmed to uniform lengths. Coordinates were transformed relative to site midpoint. Upstream and downstream flanking sequences of equal length were also considered. Local regression (loess) curves with the span parameter of 33% were used to visualise a smoothened mutation frequency in sites relative to flanking sequences.
Transcriptomic and functional associations of localised mutation rates
We evaluated the localised mutation rates in TSSs and open-chromatin sites grouped by mRNA abundance of genes in matching tumor types. Again, the results were adjusted for multiple testing using the Benjamini-Hochberg false discovery rate procedure and filtered (FDR < 0.05). To compare different cohorts and subsets of sites, we normalised per-nucleotide mutation counts by dividing these by number of sites in each gene bin, and also by the number of tumors in each cohort. Normalised counts were multiplied by 1e6 to quantify a per-tumor, per-megabase average mutation rate. To study the functional associations of localised mutation rates at CTCF binding sites, we asked whether the extent of conservation of CTCF binding sites in cell types, as observed in ENCODE ChIP-seq experiments, was indicative of the rate of localised mutagenesis at these sites. CTCF binding sites were grouped into five mutually exclusive bins of equal size based on the number of cell types where the sites were observed. Analysis of localised mutation rates in these sites was conducted as described above, findings were corrected for multiple testing correction and filtered to select significant findings (FDR < 0.05).
Down-sampling of open-chromatin sites to evaluate mRNA associations
To evaluate the mRNA associations of mutation rates in open-chromatin sites compared to TSSs, we performed a down-sampling analysis. The analysis was designed to check whether the statistical significance of mRNA associations in open-chromatin sites was systematically amplified due the larger set of open-chromatin sites available for analysis. To this end, RM2 was used to evaluate randomly sampled subsets of open-chromatin sites in all the bins of sites grouped by mRNA abundance. For each bin, we sampled the number of open-chromatin sites that were observed in the equivalent bin of TSSs. The analysis was repeated for 100 random subsets of open-chromatin sites for each bin and median P-values and corresponding fold-changes of localised mutation rates were reported. A lenient cut-off was used to filter and visualise results (unadjusted P < 0.2).
Identifying pathways with regional mutation rates
We asked whether the localised mutation rates of TSSs significantly affected genes in specific biological processes and pathways. We repurposed the RM2 model to analyse TSSs of gene sets corresponding to biological processes of Gene Ontology 62 and molecular pathways of the Reactome database 63. Gene sets were derived from the g:Profiler 64 web server (March 3rd 2020) and subsequently filtered to include 1,871 gene sets with 100 to 1,000 genes. Pathway analysis of localised mutation rates was conducted separately for each cancer type, results were corrected for multiple testing using the Benjamini-Hochberg FDR procedure separately for every cancer type and filtered for statistical significance (FDR < 0.1). We chose the less stringent significance filter since the mutation rate analysis of functional gene sets was relatively less powered given that fewer sites were considered. The pathways with significant TSS-specific mutation rates were visualised as an enrichment map 65 in Cytoscape and major biological themes were manually curated as described earlier 45. Nodes in the enrichment map were painted to reflect cancer types where these pathway enrichments were detected following the custom color scheme of the PCAWG project.
Associating regional mutation rates with driver mutations and recurrent copy-number alterations
We asked whether the localised mutation rates in CTCF binding sites, TSSs and open-chromatin sites were associated with driver mutations (i.e., SNVs, indels) or recurrent copy-number alterations (CNAs) in cancer genomes. First we collected a high-confidence set of driver mutations and CNAs in the PCAWG cohort. Driver mutations in exons of protein-coding genes were predicted for each selected cancer type using the ActiveDriverWGS method 34. We used the PCAWG variant calls after filtering tumors as described above, corrected the results for multiple testing using the Benjamini-Hochberg FDR procedure and selected significant driver genes (FDR < 0.05). FDR correction was conducted separately for each cancer type across the pooled set of protein-coding and non-coding genes. Tumors with and without SNVs or indels in predicted driver genes were used for localised mutation rate analysis. Predictions of recurrent CNAs were derived from the pan-cancer dataset of GISTIC2 calls of the PCAWG project 49. All CNA lesions at 95% confidence scores were considered and amplifications and deletions were analysed separately. High-confidence CNA events were used (GISTIC2 score = 2). Tumors with and without CNAs in the recurrently altered regions as defined by GISTIC2 were used for localised mutation rate analysis. Each cancer type was considered separately. Next we filtered very frequent and infrequent drivers and CNA events to improve the power of the RM2 analysis. We selected the driver genes and CNA regions with at least 25 tumors in the mutated (or copy-number altered) group of tumors and filtered very frequent drivers and CNAs affecting more than 2/3 of the cohort. Each driver gene and recurrent CNA locus in each cancer type was then analysed for associations with localised mutation rates in the three categories of genomic elements (open-chromatin sites, CTCF binding sites, TSSs). The binary co-factor in RM2 was used to indicate the mutated or wildtype status of the tumor with respect to the given recurrent genetic event. We first computed the significance of site-specific localised mutation rates given the presence or absence of driver gene mutations or recurrent CNAs. All combined RM2 results of driver gene mutations, cancer types and genomic sites were adjusted for multiple testing correction using the Benjamini-Hochberg FDR procedure and significant results were selected (FDR < 0.05). We then conducted an additional likelihood ratio test to evaluate the significance of the interaction between localised mutation rates and the presence of driver mutations and filtered the results to only include positive and significant interactions (unadjusted P < 0.05, main and interaction coefficients > 0). To validate and visualise the detected interactions, we separately analysed individual groups of tumors defined by the presence or absence of driver mutations and CNAs using RM2, and compared the resulting FDR values and fold-changes.
Associating CNAs with mRNA abundance
To evaluate the functional role of CNAs associated with localised mutation rates, we compared mRNA abundance levels of genes in the CNA loci in groups of tumors defined by the presence or absence of the CNA events, using matching RNA-seq data available in PCAWG 44. Genes in CNA loci were retrieved from the PCAWG GISTIC2 dataset and genes with low mRNA abundance were removed (median FPKM-UQ < 1). mRNA abundance levels of genes in CN amplified and non-amplified (i.e., balanced and deleted) tumors were compared using the nonparametric Wilcoxon test. One-sided tests were used, assuming that change in mRNA abundance would match the underlying copy-number change (i.e., copy number amplifications were tested for increase in mRNA abundance of matching genes). Results were adjusted for multiple testing correction using the Benjamini-Hochberg FDR procedure and significant results were selected (FDR < 0.05). To confirm the CNAs, we retrieved the consensus dataset of CNA calls in each tumor from the PCAWG study 49 and visualised the detected CNA segments normalised by tumor ploidy predictions in subsets of tumors defined by the presence or absence of these CNA events. Known cancer genes of the COSMIC Cancer Gene Census database 66 (v91, downloaded 14.05.2020) were identified among the genes with mRNA/CNA associations.
Method benchmarking and power analysis
We evaluated the performance of our method and statistical power using simulated variant calls, different parametrizations and down-sampling of input datasets. First, to evaluate method calibration and false-positive rates, we performed a systematic analysis of open-chromatin sites, TSSs, and CTCF binding sites in a comparable set of simulated variant calls from PCAWG. This simulated variant set was derived earlier from the same set of tumor genomes using trinucleotide-informed shuffling of mutations 4. Simulated variant calls were analysed similarly to true variant calls for total mutation counts, reference and alternative nucleotide combinations, predicted mutational signatures and transcription and mutation strand properties. Results from RM2 were adjusted for multiple testing using the Benjamini-Hochberg FDR procedure separately for results derived from true and simulated variant calls. As expected, simulated variant calls revealed no statistically significant results of localised mutation rates in any cancer type, site type or mutation subset (FDR < 0.05). We then visualised the distribution of log-transformed P-values derived from true and simulated variant calls using quantile-quantile (QQ) plots and found that highly significant P-values were detected in true datasets while the P-values derived from simulated variant calls were uniformly distributed as expected. These analyses show that our model is well-calibrated and is not subject to inflated false-positive findings. Second, to evaluate the statistical power of RM2, we performed systematic down-sampling by randomly selecting subsets of sites and tumors for localised mutation rate analysis. We focused on the PCAWG liver hepatocellular carcinoma (Liver-HCC) cohort of 300 samples and CTCF sites. A series of down-sampling configurations were used (2000, 5000, …, 100,000 sites sampled; 25, 50, …, 300 genomes sampled). Each configuration was tested 100 times in different subsets of data. For a power analysis, we evaluated the fraction of runs that revealed a significant enrichment of somatic mutations at CTCF sites (P < 0.05) and the median P-value of these 100 runs. Third, we evaluated the parameter values of RM2 that determine the genomic width of each site and the control regions of upstream and downstream flanking sequences. As expected, site-specific mutation rates were consistently identified at multiple values of the width parameter for each class of site (open-chromatin sites, CTCF binding sites, TSSs), indicating robustness of our analysis to different parameter values. However, different site classes showed preferences towards shorter sites (CTCF binding sites: 20-100 bps) or longer sites (open-chromatin sites and TSS: 200-800 bps), likely due to different underlying mutational processes. For the entire study, the optimal genomic size of every site class was selected based on the strongest effect size and significance across multiple cancer types. The value of 50 bps (±25 bps) was selected for CTCF sites. For open-chromatin sites and TSSs, we selected the common site width of 200 bps (±100 bps) that showed strong effects in both TSSs and open-chromatin sites, to increase comparability of the two classes.
Author contributions
J.R., C.A.L. and D.A.R. designed and implemented the method. J.R. and C.A.L. analysed the data. J.R. wrote the manuscript with significant input from C.A.L. and D.A.R. J.R. conceived and supervised the project. All authors reviewed and edited the manuscript and approved the final version.
Acknowledgments
This work was supported by the Canadian Institutes of Health Research (CIHR) Project Grant to J.R. and the Investigator Award to J.R. from the Ontario Institute for Cancer Research (OICR). C.A.L. partially was supported by a Graduate Student Fellowship from the Department of Medical Biophysics, University of Toronto. Funding to OICR is provided by the Government of Ontario.