Chromatin accessibility of primary human cancers ties regional mutational processes with tissues of origin

Regional mutagenesis in cancer genomes associates with DNA replication timing (RT) and chromatin accessibility (CA) of normal cells, however human cancer epigenomes remain uncharacterized in this context. Here we model megabase-scale mutation frequencies in 2517 cancer genomes with 773 CA and RT profiles of cancers and normal cells. We find that CA profiles of matching cancers, rather than normal cells, predict regional mutagenesis and mutational signatures, indicating that most passenger mutations follow the epigenetic landscapes of transformed cells. Carcinogen-induced and unannotated signatures show the strongest associations with epigenomes. Associations with normal cells in melanomas, lymphomas and SBS1 signatures suggest earlier occurrence of mutations in cancer evolution. Frequently mutated regions unexplained by CA and RT are enriched in cancer genes and developmental pathways, reflecting contributions of localized mutagenesis and positive selection. These results underline the complex interplay of mutational processes, genome function and evolution in cancer and tissues of origin.


INTRODUCTION 25
The cancer genome is a footprint of its evolution and molecular environment that is shaped by 26 somatic mutations, such as single nucleotide variants (SNVs) and structural alterations (1,2). 27 While a minority of mutations called drivers confer cells with selective advantages (3-5), most 28 mutations are considered functionally neutral passengers that are caused by diverse mutational 29 processes (6-8). Somatic mutagenesis and positive selection of known cancer genes also affects 30 normal tissues (9,10). Characterizing the landscape of somatic mutations helps understand the 31 underlying mutational processes and better evaluate the functional consequences of mutations 32 and their roles in cancer etiology and evolution. 33 Processes of somatic mutagenesis act at different scales of the genome (11,12). At the 34 trinucleotide resolution, mutational signatures of SNVs are associated with endogenous and 35 exogeneous processes related to aging, carcinogen exposures, DNA repair deficiencies, and 36 cancer therapies (6,13). At the local resolution of 100-1000 bps, non-coding genomic elements, 37 such as transcription start sites and binding sites of CTCF, are enriched in mutations (14-16). 38 However, the precise molecular mechanisms driving these mutational processes remain 39 uncharacterized. At the regional, megabase-scale resolution of the genome, variation in mutation 40 frequencies shows a complex interplay of DNA replication timing (RT), chromatin accessibility 41 (CA) and transcriptional activity (17)(18)(19). Early-replicating, transcriptionally active regions of 42 open chromatin have fewer mutations than late-replicating, passive regions of heterochromatin, 43 potentially due to increased error rates and decreased mismatch repair later in DNA replication 44 (20)(21)(22)(23). Mutational signatures are distributed asymmetrically with respect to DNA replication 45 origins and timing (24). Regional mutagenesis has been associated with epigenetic information 46 of related normal cells, providing evidence of cells of cancer origin contributing to somatic 47 variation (25) and allowing classification of cancers of unknown origin (26). However, CA and 48 RT profiles of only common cell lines and normal tissues have been used to characterize regional 49 mutational processes while the epigenetic landscapes of primary human cancers remain 50 unexplored. 51 To decipher regional mutational processes in the context of cancer epigenomes, we analyzed a 52 large and diverse collection of CA and RT profiles of cancers, normal tissues, and cell lines as 53 predictors of regional mutagenesis in thousands of whole cancer genomes using machine 54 learning. CA profiles of matching cancer types, rather than normal tissues, appear as 55 determinants of regional mutagenesis and mutational signatures. We found tissue-of-origin 56 effects of CA and RT in most predictions, bespoke deviations in specific cancer types and 57 mutational signatures, and a convergence of excess mutations to developmental and cancer 58 pathways. Together, these results underline the spatial and temporal complexity of regional 59 mutagenesis in cancer genomes. 60

RESULTS 62
Chromatin accessibility of primary cancers is a major determinant of regional mutagenesis 63 To evaluate the associations of CA and RT with regional mutagenesis in cancer genomes, we 64 analyzed somatic variant calls of whole cancer genomes, 677 CA profiles of primary human 65 cancers, normal tissues and cell lines, and RT profiles of 16 cell lines in 6 cell cycle phases using 66 the random forest framework (Figure 1) Figures 1-2). Focusing on 2,465 mappable one-70 megabase regions, we derived somatic SNV counts for the pan-cancer dataset and 25 cancer 71 types of the largest cohorts. We processed 773 CA and RT profiles of primary cancers, normal 72 tissues, and cell lines as mean genomic signals per megabase ( Figure 1B). To map the complex 73 non-linear associations of CA and RT with regional mutagenesis, random forest regression 74 models were trained with megabase-scale mutation frequencies as outcomes and CA and RT 75 profiles as predictors (i.e., features) ( Figure 1C). The most informative predictors were 76 quantified statistically and using local prioritization methods (31) ( Figure 1D). As expected, late 77 RT profiles inversely correlated with CA profiles and regional mutagenesis profiles clustered 78 according to cancer types (Supplementary Figures 3-4). 79 Given the diverse collection of epigenomic profiles, we asked whether CA profiles of cancers (n 80 = 383) or normal cells and tissues (n = 244) were more informative of regional mutagenesis in 81 cancer genomes. We predicted regional mutation frequencies using random forests in two 82 Monte-Carlo cross-validation experiments with matched data splits where the predictors included 83 either cancer CA profiles or normal CA profiles, respectively. RT profiles were also included in 84 both models to focus on the relative contributions of CA profiles. We found that CA profiles of 85 cancers were more accurate predictors of regional mutagenesis in 19/25 cancer types (empirical 86 P < 0.01) (Figure 2A). The strongest signal was observed in breast cancer where the predictions 87 informed by cancer CA profiles were nearly twice as accurate as the models informed by normal 88 tissue CA (median adj.R 2 0.69 vs. 0.36; P < 0.001) ( Figure 2B). Stronger associations of cancer 89 CA profiles and regional mutagenesis were also found in cancers of the prostate, ovary, uterus, 90 kidney, and pancreas (∆ adj.R 2 > 0.1; P < 0.001) and in the pooled pan-cancer set of 37 cancer 91 types (adj.R 2 0.90 vs. 0.87; P < 0.001). A few exceptions were also apparent; in melanoma, 92 models utilizing CA profiles of normal tissues were significantly more accurate (adj.R 2 0.69 for 93 normal CA vs. 0.65 cancer CA; P = 0.004). The high somatic mutation burden of normal skin 94 cells due to long-term ultraviolet light exposure (9) appears consistent with the model predictions 95 that a fraction of mutations in melanomas are distributed according to CA profiles of normal 96 tissues. In medulloblastoma, CA profiles of normal tissues were also more predictive of regional 97 mutagenesis (adj.R 2 = 0.45 for normal CA vs. adj.R 2 = 0.36 for cancer CA; P = 0.001), 98 potentially explained by the developmental origin of this pediatric brain cancer (32). CA profiles 99 of normal tissues also improved prediction accuracy in B-cell non-Hodgkin's lymphoma 100 (BNHL) and chronic lymphocytic leukemia (CLL) (P < 0.01). Overall model accuracy was 101 partially explained by genome-wide mutation burden of cancer types (Spearman rho = 0.61, P = 102 0.0011) but not cohort size (rho = 0.22, P = 0.28) (Supplementary Figure 5). In summary, 103 regional mutagenesis is more strongly associated with CA of primary human cancers rather than 104 normal tissues and cell lines in most cancer types, indicating that most somatic mutations occur 105 after the cells have acquired the epigenetic characteristics of cancer cells. 106 107 108 6 109 Top predictors of regional mutagenesis match cancer types and sites of origin 110 To interpret regional mutagenesis through cancer tissues of origin, we asked which CA and RT 111 profiles contributed the most to the predictive models. We included all 773 profiles as predictors 112 and analyzed 14 cancer types for which profiles of primary cancers and relevant normal tissues 113 were available. We selected five most significant predictors for each cancer type (P < 0.001) and 114 quantified these using Shapley Additive exPlanation (SHAP) scores (31) that reflect associations 115 with mutation burden accumulated across genomic regions. SHAP scores were negatively 116 correlated with CA of cancers and normal tissues (rho = -0.75; P < 10 -16 ) ( Figure 3A).  replicating regions were positively correlated with regional mutagenesis (rho = 0.77, P < 10 -16 ) 118 while early replicating regions showed a less-variable negative correlation (rho = -0.78, P < 10 -119 16 ). This inverse relationship of CA and RT with respect to regional mutagenesis is consistent 120 with previous studies (17-23), however it is extended to a diverse collection of CA and RT 121 profiles of primary cancers, normal tissues, and cell lines. Non-linear associations of regional 122 mutagenesis and its epigenomic predictors are apparent in individual cancer types 123 (Supplementary Figure 6). This analysis underlines the complex interactions of regional 124 mutagenesis with CA and RT in this pan-cancer cohort and warrants detailed analysis of 125 individual predictors. 126 We examined the top predictors of regional mutagenesis. CA profiles of matching cancers were 127 among the strongest predictors in eight of 14 cancer types including liver, breast, kidney, 128 stomach, and colorectal cancers (P < 0.001), emphasizing tissue-of-origin associations ( Figure  129 3B). For example, regional mutagenesis in breast cancer showed positive and negative 130 associations with four CA profiles of breast cancers of the TCGA dataset ( Figure 3C).

131
Additional associations appeared at the level of organ systems as CA profiles of stomach and 132 colorectal cancers were the top predictors of regional mutagenesis in colorectal, stomach, biliary 133 and esophageal cancers, suggesting similarities of mutational processes or epigenomes of the 134 gastrointestinal tract. Interestingly, regional mutagenesis in lung adenocarcinomas was also 135 explained by CA profiles of stomach and lung adenocarcinomas. Overall, matched cancer-136 specific CA profiles showed stronger associations with regional mutagenesis than profiles of 137 normal cells. 138 CA profiles of matching normal tissues associated with regional mutagenesis in five cancer 139 types. Mutations in melanoma were predicted by two CA profiles of normal melanocytes and 140 three profiles of melanomas, whereas a three-fold higher feature importance score was assigned 141 to the normal tissue (incMSE 1.2 x 10 5 vs. 3.4 x 10 4 ) ( Figure 3B). Accessible chromatin of 142 cancers and normal melanocytes was relatively depleted in mutations according to SHAP 143 analysis ( Figure 3D). This is consistent with the shaping of melanoma genomes through the 144 chromatin landscape of normal melanocytes (9) earlier in cancer evolution. CA profiles of 145 normal B-cells were found as predictors in lymphoid cancers BNHL and CLL. Somatic 146 hypermutation (SHM) of immunoglobulin genes in normal B-cells and aberrant SHM in 147 lymphomas (33) potentially explains this association. In glioblastoma (GBM), CA profiles of 148 neuronal tissues (hippocampus, astrocytes, spinal cord) as well as lower-grade gliomas were 149 selected as top features ( Figure 3E). The mixture of CA of cancers and normal neural cells 150 predictive of regional mutagenesis of GBM may reflect its extensive intratumoral heterogeneity 151 and proposed origin in stem-like cells (34). 152 In most cases, CA negatively correlated with regional mutagenesis according to SHAP scores,153 both in CA profiles of primary cancers, such as liver and breast cancer, as well as related normal 154 tissues, such as melanocytes in melanoma and astrocytes in glioma (Figure 3D-E). However, 155 associations of high CA and increased mutation burden were also apparent. In breast cancer, the 156 two most predictive CA profiles showed positive SHAP scores in highly accessible genomic 157 regions, indicating the activity of a mutational process targeting open chromatin ( Figure 3C) 158 (Supplementary Figure 6). Thus, deconvoluting the bulk profiles of megabase-scale mutation 159 burden helps map interactions of regional mutagenesis with CA and RT. 160 RT profiles were the major predictors of regional mutagenesis in six cancer types. Mutations in 161 lung and head squamous cell carcinomas (SCC) associated with RT profiles of the squamous cell 162 line of normal human epidermal keratinocytes (NHEK) (Figure 3B,F). The squamous cell 163 association indicates cell-of-origin patterns of regional mutagenesis, while the association with 164 normal cells may reflect mutagenesis earlier in cancer evolution, potentially through the tobacco 165 signature SBS4 that represents 44% and 12% of SNVs in the Lung-SCC and Head-SCC cohorts, 166 respectively. Similarly, RT profiles of lymphoblastoid cell lines were among the top predictors in 167 CLL and BNHL. Tissue-specific RT profiles of the cancer cell lines MCF-7 and HepG2 were 168 found in breast and liver cancers, respectively. Most RT predictors (13/16) represented late-169 replicating cell cycle phases G2 and S4. Individual RT profiles positively associated with 170 mutagenesis in late-replicating regions (e.g., phase G2 of MCF-7 in breast cancer) and negatively 171 in early-replicating regions (e.g., phase S1 of HNEK in head SCC) ( Figure 3C,F), consistent 172 with earlier observations that elevated regional mutagenesis is an effect of increased DNA 173 damage and decreased repair in late replication (20). However, RT profiles were generally 174 underrepresented among top predictors compared to CA profiles. This is likely because fewer 175 and less-diverse RT profiles of cell lines offer only a limited representation of mutational 176 processes in diverse cancer genomes, while CA profiles of primary human cancers provide 177 complementary information. Together, this analysis extends our findings of tissue-specific CA 178 and RT profiles as the principal predictors of regional mutagenesis and underlines cell-of-origin 179 effects and cancer heterogeneity. 180

Associations of mutational signatures with chromatin accessibility and replication timing 182
We asked whether the associations of regional mutagenesis with CA and RT can be further 183 explained by mutational signatures. We assigned each SNV to its most probable single base 184 substitution (SBS) signature (6) and predicted the regional distributions of signatures using 773 185 CA and RT profiles. First, we compared the accuracy values of random forest models in 186 predicting six classes of mutational signatures based on etiology: two age-related classes (SBS1 187 and SBS5 Signature SBS1 deviated from broad CA-driven patterns of regional mutagenesis in several 203 cancer types. In kidney cancer, SBS1 mutations associated with CA profiles of normal tissues 204 such as renal cortex epithelium and kidney glomerulus as well as extra-adrenal 205 pheochromocytoma (PCPG), a rare endocrine cancer. In contrast, SBS5 mutations and others 206 predominantly associated with CA of kidney cancers ( Figure 4C). Similar effects were observed 207 in breast, colorectal and stomach cancers: SBS1 mutations associated with CA profiles of normal 208 tissues and unrelated cancers, while other signatures associated with CA profiles of matching 209 cancers. However, SBS1 mutations were predicted less accurately than other signatures, 210 potentially due to their overall lower frequency (Supplementary Figure 7). Interestingly, an 211 inverse relationship was observed in GBM that may reflect its intratumoral heterogeneity and 212 stem cell origins: the top predictors of SBS1 included CA profiles of four lower-grade gliomas 213 (LGG) while the normal tissue profiles of hippocampus, astrocytes, and spinal cord primarily 214 associated with clock-like signatures SBS5 and SBS40 ( Figure 4D). The clock-like SBS1 215 signature of 5-methylcytosine deamination is associated with cancer patient age and stem cell 216 division rate and affects the somatic genomes of normal tissues and adult stem cells (10,35,36). 217 The association of SBS1 with the epigenomes of normal tissues suggests that SBS1 mutations in 218 cancer genomes represent a footprint of earlier cancer evolution or somatic mutagenesis in 219 normal cells of cancer origin. 220 We asked whether the mutations of specific signatures were enriched or under-represented in 221 regions of open chromatin. While mutational signatures were generally negatively associated 222 with CA in accordance with bulk mutations, positive associations were also apparent. In breast 223 cancer, SBS13 mutations of APOBEC/AID activity positively associated with high CA scores 224 (Figure 4E), in agreement with the observations that AID targeting of epigenetically active 225 elements results in kataegis and clustered mutational signatures (6,37,38). As another example, 226 SBS1 mutations in kidney cancer positively associated with the CA profile of the PCPG cancer 227 while negative associations with CA were apparent in other signatures and CA profiles ( Figure  228 4C). In summary, this analysis highlights the complex interactions of CA and RT with regional 229 mutagenesis and cancer heterogeneity and helps characterize the mechanisms of mutational 230 processes. 231 232 233

Excess mutations unexplained by epigenomes converge to cancer genes and developmental 234 pathways 235
To quantify the regional mutagenesis unexplained by CA and RT, we investigated the genomic 236 regions that were enriched in mutations above the levels expected from epigenomes. To enable a 237 gene-level functional analysis, we repeated the predictions of regional mutagenesis at a finer 238 genomic resolution (100 kbps) and selected 1,330 regions in 14 cancer types that were 239 significantly enriched in mutations based on the CA-and RT-informed model residuals (FDR < 240 0.05) (Figure 5A). While the mutation-enriched regions were largely tissue-specific with 86% 241 detected in only one cancer type, hierarchical clustering of the regions by residuals was 242 consistent with cancer types (Figure 5B). For example, lymphoid cancers, lung cancers and 243 gastrointestinal cancers made up the three most distinct clusters. Thus, the regional mutational 244 processes independent of CA and RT affect similar genomic regions in related cancer types. 245 We performed a functional analysis of the frequently mutated genomic regions, hypothesizing 246 that these could be characterized by pathways and genes involved in cancer. The regions 247 encoded 730 protein-coding genes including 61 known cancer genes (39), significantly more 248 than expected by chance (27 expected, Fisher's exact P = 1.9 x 10 -9 ) ( Figure 5B). Most driver 249 genes were only found in single cancer types and represented key disease-specific drivers such as 250 EGFR and TERT in glioma, MYC in BNHL, PIK3CA in breast cancer and APC in colorectal 251 cancer (Supplementary Figure 8). As an exception, one genomic window recurrently mutated 252 in 11 cancer types includes the interferon regulatory factor and oncogene IRF4 (40), the 253 phosphatase DUSP22 recently suggested as a network-implicated driver gene due to non-coding 254 mutations (41), and super-enhancers of immune cells (42), indicating a potential pan-cancer 255 region of interest (Supplementary Figure 9). 256 We then asked whether the frequently-mutated regions were associated with common biological 257 functions by prioritizing pan-cancer signals of mutation enrichment using the integrative 258 ActivePathways method (43). The analysis revealed 220 significantly enriched pathways (FDR < 259 0.05), of which 162 (74%) were detected in more than one cancer type ( Figure 5C). 260 Developmental processes including the nervous system, heart and kidney, stem cell development 261 and morphogenesis were prominently represented together with cancer hallmark processes such 262 as cell cycle, apoptosis, cell adhesion, hypoxia response, and MAPK, EGF and FGF signalling 263 pathways. Processes of the immune system, stress response, reproduction and hormone 264 regulation were also apparent. Enriched mutations converged to similar pathways and processes 265 across multiple cancer types although most genomic regions were only detected in one or few 266 cancer types. Convergence of these excess mutations to developmental and cancer pathways is 267 potentially explained by further mutational processes targeting active regions of the genome, 268 while the enrichment of known cancer driver genes suggests that positive selection of functional 269 mutations may also contribute to this additional mutation burden. This analysis exemplifies the 270 complex interplay of multi-scale mutational processes and genome function. 271

DISCUSSION 273
Our analysis highlights chromatin accessibility of primary human cancers as a major covariate of 274 regional mutational processes that is supported by tissue of origin associations of whole cancer 275 genomes and epigenomes of matching cancer types. These observations are apparent in several 276 common cancer types of the largest global burden. Cancers such as melanoma and lymphoma 277 where normal tissue epigenomes are highly predictive of regional mutagenesis have an etiology 278 consistent with early somatic mutagenesis in normal tissues of origin. Combined associations 279 with normal and cancer epigenomes as observed in GBM may also reflect intratumoral 280 heterogeneity of cell populations and regional mutagenesis. These findings extend earlier studies 281 that used the epigenetic profiles of cell lines and normal tissues to characterize mutational 282 processes. Overall, this analysis suggests that in most cancer types, the megabase-scale 283 landscape of passenger mutations is primarily shaped later in cancer evolution following the 284 epigenetic transformation to cancer cells. 285 Replication timing information also associated with regional mutagenesis and confirmed strong 286 effects with cell types related to cancer origin. However, CA profiles of primary human cancers 287 evidently captured a larger fraction of variation of regional mutagenesis compared to RT 288 profiles, apart from squamous cell cancers that strongly associated with relevant cell lines. RT 289 profiles make up a smaller subset of epigenomic predictors in our dataset and include mitotic cell 290 lines that offer only limited representation of the diverse disease types in the pan-cancer cohort. 291 Interestingly, DNA replication has been shown to determine chromatin state (44). Thus, the 292 informative CA profiles of human cancers may represent a proxy of cancer-specific replication 293 dynamics. 294 Mutational signature analysis revealed interactions of mutational processes with CA and tissues 295 of origin. Carcinogen signatures, as well as signatures of unknown etiology, were overall better 296 predicted by CA and RT, in contrast to signatures of aging and DNA damage where the genome-297 wide predictions were less accurate. The stronger association of carcinogen signatures suggests 298 that the chromatin environment interacts with DNA damage or repair processes of carcinogen 299 exposure, for example through elevated mutational processes targeting active genes that are 300 otherwise protected from mutations through error-free mismatch repair (38). Early replicating 301 regions in cells exposed to tobacco mutagens show elevated mutagenesis in transcribed strands 302 due to differential nucleotide excision repair activity (45). Based on their stronger interactions 303 with RT and CA profiles, we extrapolate that some mutational signatures of currently unknown 304 etiology may relate to carcinogens. SBS17a/b mutations show some of the strongest interactions 305 with CA and RT in stomach and esophageal cancers in our analysis. This signature is currently 306 of unknown cause, however it has been linked to gastric acid reflux and reactive oxygen species 307 (24). Further integrative analysis of clinical and lifestyle information with patterns of regional 308 mutagenesis may shed light to these mutational processes. 309 Mutations of signature SBS1 associated with CA profiles of relevant normal tissues in multiple 310 cancer types, in contrast to other signatures that were often associated with cancer epigenomes. 311 SBS1 mutations follow a clock-like pattern whose frequency in cancer genomes correlates with 312 patient age and stem cell replication rates (35). SBS1 mutations contribute to the somatic 313 variation landscapes of normal adult tissues (46) and in embryonic development (47). Based on 314 our data, we speculate that SBS1 mutations in cancer genomes represent a footprint of early 315 cancer evolution or somatic variation of normal cells that occurs prior to the acquisition of 316 cancer-specific epigenetic profiles. 317 We observed a functional convergence to developmental processes and cancer-related pathways 318 in the genomic regions where CA and RT profiles insufficiently captured elevated mutation 319 burden. These data suggest that additional mutational processes affect lineage-specific 320 developmental genes and open-chromatin regions that are distinct in individual cancer types, 321 however map to the same molecular pathways across cancer types. For example, transcription 322 start sites of highly expressed genes and constitutively-bound binding sites of CTCF are subject 323 to elevated local mutagenesis in multiple cancer types (16) and pathways in our data suggests that some mutations unexplained by CA and RT are functional 328 in cancer and their frequent occurrence at specific genes, non-coding elements and molecular 329 pathways is explained by positive selection (3)(4)(5)41). Further study of these regions may deepen 330 our understanding of mutational processes and refine the catalogues of driver mutations. 331 This approach enables future studies to decipher the mechanisms and phenotypic associations of 332 mutational processes. Clinical, genetic, and epigenetic profiles of cancer patients can be 333 integrated to understand how regional mutational processes and the chromatin landscape are 334 modulated by clinical variables such as stage, grade or the therapies applied, genetic features 335 such as somatic driver mutations or inherited cancer risk variants, or lifestyle choices such as 336 tobacco or alcohol consumption. Complementary insights from sub-clonal reconstruction 337 analysis of cancer genomes (2,49), as well as single-cell sequencing of genomes and epigenomes 338 will allow mapping of regional mutagenesis at the level of distinct cell populations contributing 339 to temporal and spatial variation in mutational processes. As such multimodal datasets grow, we 340 can learn about early cancer evolution by comparing regional mutagenesis in the genomes of 341 cancers and normal cells. Understanding the molecular and genetic determinants of regional 342 mutagenesis and signatures in cancer genomes may help characterize carcinogen exposures and 343 genetic predisposition, ultimately enhancing early cancer detection and prevention in the future. 344

Methods 345
Somatic mutations in whole cancer genomes. Somatic single nucleotide variants (SNVs) of 346 2,583 whole cancer genomes of were derived from the Pan-cancer Analysis of Whole Genomes 347 (PCAWG) project (1) and hypermutated tumors (66) were removed, resulting in a dataset of 23.2 348 million SNVs in 2,517 cancer genomes. Indel mutations and all variants in sex chromosomes 349 were excluded. We analyzed 25 cancer types with at least 30 genomes per cohort and the pan-350 cancer cohort of 37 cancer types (Supplementary Figure 1). Mutations were mapped to 351 GRCh38 coordinates using LiftOver (50). 352 Chromatin accessibility (CA) and replication timing (RT Random forest regression. Regional mutation burden was modeled as a function of CA and RT 363 profiles with random forest regression (52). Monte-Carlo cross-validation was used evaluate 364 model performance over 1,000 80/20% data splits for training and validation. We used the 365 adjusted R 2 (adj.R 2 ) metric of accuracy that measures the complexity-adjusted fraction of 366 variance explained by the model. 367 CA of normal cells and cancers in regional mutagenesis. Two sets of random forest regression 368 models with CA profiles of cancers and normal cells, respectively, were run in 1,000 Monte-369 Carlo 80/20% cross-validations using matched genomic regions for training and validation. 370 Differences in adj.R 2 values of models informed by cancer CA profiles relative to models 371 informed by normal tissue CA profiles were computed (∆adj.R 2 ) with 95% confidence intervals. 372 Scatterplots of mutation counts were derived from models trained with full data. 373 Feature analysis of individual CA and RT profiles. The increase in mean-squared-error 374 (incMSE) metric was used to evaluate CA and RT profiles as predictors of regional mutagenesis. 375 Significance and variation of incMSE was evaluated. Permutation tests were used to detect the 376 profiles where the incMSE values significantly exceeded those derived from randomly shuffled 377 model responses. Empirical p-values were computed for every profile and cancer type across 378 1,000 iterations and significant profiles were selected (P < 0.001). Bootstrap analysis of genomic 379 regions over 1,000 iterations revealed the variation in incMSE values. 380 Local effects of CA and RT profiles to regional mutagenesis. The Shapley Additive 381 exPlanation (SHAP) method (31) was used to evaluate the effects of profiles to mutagenesis in 382 specific cancers and genomic regions. SHAP scores represent the importance and direction of 383 each feature (i.e., a CA or RT profile) in predicting an observation (i.e., a genomic window). 384 SHAP scores were computed separately for cancer types on models trained on full datasets. 385 Mutational signature analysis. SNVs were annotated to single base substitution (SBS) 386 signatures from PCAWG (6) using top probabilities. In each cancer type, signatures with at least 387 10,000 mutations and at least 5% of total mutation burden were selected (Supplementary 388 Figure 7). Random forest regression was conducted with evaluation of accuracy and feature 389 analysis as described above. Signatures were grouped based on etiology according to the 390 COSMIC database (v 3.2, downloaded March 2021). Adj.R 2 values for different groups of 391 signatures were compared using ANOVA analysis and F-tests where the average signature 392 exposures in cancer types were used as covariates. 393 Selecting genomic regions with excess mutations beyond CA and RT predictions. Regional 394 mutagenesis was predicted for 100-kbps genomic regions to enable more granular functional 395 interpretation. Genomic regions were selected based on model residuals, i.e., where the observed 396 mutation counts significantly exceeded model predictions. Residuals were Z-transformed and the 397 one-tailed P-values were adjusted for multiple testing (FDR < 0.05). Regions with excess 398 mutations were visualized as a heatmap with hierarchical clustering and correlation distance. 399 Known cancer genes were derived from the Cancer Gene Census database (39) (downloaded 400 Nov 26 th 2020). Enrichment of cancer genes was evaluated with a Fisher's exact test. 401 Pathway enrichment analysis of highly-mutated genomic regions. Integrative pathway 402 enrichment analysis with the ActivePathways method (43) was used to find overrepresented 403 pathways and prioritize genes. Genes were assigned the P-values of their genomic regions and 404 were scored in ActivePathways such that genes in regions with mutation enrichments in multiple 405 cancer types ranked higher. Significantly over-represented pathways (FDR < 0.05) were 406 visualized as an enrichment map using standard protocols (53). 407 A detailed description of methods is available in Supplementary Materials. 408