Universality of distribution of tumor mutation burden - a biomarker for the tumor evolution and disease risk

Cancers, resulting in uncontrolled cell proliferation, are driven by accumulation of somatic mutations. Genome-wide sequencing has produced a catalogue of millions of somatic mutations, which contain the evolutionary history of the cancers. However, the connection between the mutation accumulation and disease development and risks is poorly understood. Here, we analyzed more than 1,200,000 mutations from 5,000 cancer patients with whole-exome sequencing, and discovered two novel signatures for 16 cancer types in The Cancer Genome Atlas (TCGA) database. A clock-like mutational process, a strong correlation between Tumor Mutation Burden (TMB) and the Patient Age at Diagnosis (PAD), is observed for cancers with low TMB (mean value less than 3 mutations per million base pairs) but is absent in cancers with high TMB. We also validate this finding using whole-genome sequencing data from more than 2,000 patients for 24 cancer types. Surprisingly, we discovered that the distribution of TMB are universal. At low TMB it exhibits a Gaussian distribution and transitions to a power law at hight TMB. The differences in cancer risk between the sexes are also mainly driven by the disparity in mutation burden. The TMB variations, imprinted at the chromosome level, also reflect accumulation of mutation clusters within small chromosome segments in high TMB cancers. By analyzing the characteristics of mutations based on multi-region sequencing, we found that a combination of TMB and intratumor heterogeneity could be a potential biomarker for predicting the patient survival and response to treatment.


29
Cancer is a complex disease caused by a gradual accumulation of somatic mutations, 30 which occur through several distinct biological processes. Mutations could arise sponta-31 neously during DNA replication 1 , through exogenous sources, such as ultraviolet radiation, 32 smoking, virus inflammation and carcinogens 2 or could be inherited 3 . During the past two 33 decades, a vast amount of genomic data have been accumulated from patients with many 34 cancers, thanks to developments in sequencing technology 4 . The distinct mutations con- 35 tain the fingerprints of the cancer evolutionary history. The wealth of data has lead to the 36 development of mathematical models 5-8 , which provide insights into biomarkers that are 37 reasonable predictors of the efficacy of cancer therapies 9 . Our study is motivated by the instance, it has been suggested that half or more of the genetic mutations in cancer cells 43 are acquired before tumor initiation 10,11 . In contrast, analyses of colorectal cancer tumors 12 44 suggest that most of the mutations accumulate during the late stages of clonal expansion. 45 In the cancer phylogenetic tree, representing tumor evolution, most mutations would appear 46 in the trunk in the former case 10 while the branches would contain most of the mutations if 47 fore they acquired the initiating mutation. We also observe in Fig. 1 that the correlation 154 between the TMB and patient age becomes weaker as the overall TMB increases. As noted 155 above, the ρ value decreases from 0.9 to 0.7 as the overall mean value (averaged over all 156 patients for different ages) of TMB increases from 0.4 mutations/Mb to 3.0 mutations/Mb. 157 High TMB and PAD are weakly correlated: As the overall TMB exceeds 3 muta-158 tions/Mb, there is no meaningful correlation between TMB and PAD ( Fig. 2 and Fig. S2). 159 Interestingly, the results in Fig. 2 show that the TMB even decreases in certain cancer types is low (Fig. 1). However, the mutation rate increases in high TMB cancers (Fig. 2). Then, 178 the number of mutations N t is given by, where the first term is the AM in the normal replication process, and the second term arises 180 due to the accelerated mutations generated during the tumor formation stage. Note that T I 181 corresponds to the time at tumor initiation. Because the latency period T − T I for tumors 182 from the initiation till diagnosis is likely to be similar for patients with the same type of 183 6 cancer 10 , the second term in Eq. (1) is likely to be a constant subject to only minor variations.

184
If ∆µ 1 µ 1 , then the first term in Eq. (1) is negligible, which leads to the weak dependence 185 of TMB on the patient age as found in Fig. 2. Another potential mechanism for the finding 186 in Fig. 2 could be that catastrophic events (such as chromoplexy, and chromothripsis) can 187 lead to a burst of mutations that accumulate in tumors during a short period of time as 188 observed in some cancers 39,40 .

189
It is instructive to calculate the fraction of accumulated mutations before tumor initiation 190 for a cancer with high TMB. For the hepatocellular carcinoma (LIHC) the median age of 191 patients at diagnosis is 61, so the number N I = µ 1 T I of AM before tumor initiation is less 192 than µ 1 T ≈ 3000 (with rate µ 1 ≈ 50 mutations/year 12,25 ). From Fig. 2B, which shows 193 that the TMB is ≈4 mutations/Mb at the same age and taking the genome length to be 194 3,000Mb, the total number of mutations accumulated at age 61 is about 12,000 (4 × 3000). 195 Thus, the fraction of AM before tumor initiation should be less than 25%. We surmise that 196 in cancers with high TMB (Fig. 2) most of the somatic mutations occur during the late 197 stages of clonal expansion (see Table III). This conclusion agrees with the observations in 198 colorectal cancers 12 , which allows us to conclude must be a cancer type with high TMB (> 199 3 mutations/Mb).

200
Extrinsic Factors dominate mutation load in certain cancers: So far we have 201 focused on somatic mutations arising from the intrinsic process due to DNA replication.

202
The tissues of several cancers, such as head and neck, stomach, and skin, are constantly Eq. (1) by adding an additional term N ext = µ ext T . As long as µ ext ∆µ 1 , a linear relation 207 between TMB and patient age holds (see Fig. 3A-D and Fig. S3). In these cases, just as 208 in Fig. 1, more mutations appear at the trunk of tumor phylogenetic trees types (see Table   209 II and Fig. 3E). The strong linear correlation, with high Pearson correlation coefficient in In contrast, the majority of the mutations are expected to be present in low frequencies 214 and appear at the branches and leaves of the tumor phylogenetic tree (see Fig. 3E) for the 215 7 cancers shown in Fig. 2. This results in immune surveillance escape and drug resistance 12 . In 216 support of this finding, we found that the most deadly cancers (ones with the lowest 5-year 217 survival rate), such as pancreatic cancer (8.5%), liver hepatocellular carcinoma (17.7%), and 218 lung cancer (18.6%) 43 (see Fig. S4), appear in Fig. 2. 219 Relation of TMB and PAD from whole-genome sequencing: Because a strong 220 linear relation is observed (see Fig. S5) between the number of SNVs for the whole genome 221 and that for the whole exome of cancer patients, we expect similar correlation to be borne 222 out by analyzing the WGS data. Using the recently available whole-genome sequencing 223 data provided by the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) 224 Consortium 27 , we performed a similar analysis for 24 cancer types (see Fig. 4). Due to the 225 limited number of samples for each cancer type (see Table VII), we plotted all the patient 226 data without binning into age groups. We found a strong correlation between TMB and 227 PAD for cancer types with low mutation burden (see due to substantial variations in the mutation burden. Therefore, the findings extracted from 236 the WES data is robust, and is independent of the type of database we used.

237
Two universal patterns in cancers with low and high TMB: Our analyses above 245 Surprisingly, the data from these cancers also fall into a straight line but with a nearly zero 246 slope (see the Pearson correlation coefficient ρ, P-value and the green line in Fig. 5B), which 247 8 supports the absence of any correlation between TMB and PAD for these cancer types.

248
Next, we investigated the TMB distribution for each cancer type. Surprisingly, we found 249 two universal TMB distribution patterns. A Gaussian distribution is found for all cancers 250 analyzed in Figs. 1 and 3 (see Fig. S6). Remarkably, after rescaling, the TMB distributions 251 from the nine cancer types collapse onto a single master curve described by the normal 252 distribution (see Fig. 5C). The universal TMB distribution found in Fig. 5C for the nine 253 cancers is related to the Gaussian distribution of PAD (see Figs. S17 and S18), and the linear 254 relation between TMB and PAD (see Figs. 1 and 3). By simulating a population of patients 255 with such properties, we obtain a Gaussian distribution for TMB (see Fig. S8 and Materials 256 and Methods). We find it remarkable that for tumor evolution, which is a complex process, 257 essentially a single easily calculable parameter (TMB) from database explains a vast amount 258 of data for low TMB cancers.

259
In contrast, the distribution of cancers in Fig. 2 with large TMB is governed by a power-260 law (see Fig. S7 and Fig. 5D Figs. S14 and S15) to capture more nuanced differences between these two cancers. Al-286 though the mutation frequency profiles are uniform in chromosomes 1 and 3 in LAML, 287 occasionally, deviation from the mean value is detected (see the peak labeled with the name 288 of a mutated gene in Fig. 6C and Fig. S14) as a consequence of strong selection of driver 289 mutations 45 . For the high TMB cancer, LIHC, a much more complex landscape is observed 290 in the mutation profile. In addition to great variations of mutation frequency from one 291 region to another, more highly mutated genes are detected and are often found to appear 292 in a very close region along the genome (see Fig. 6D and Fig. S15). This has been found in 293 other high TMB cancers such as PAAD and LUAD 28 .

294
We then created the rainfall plot, which is frequently used to capture mutation clusters 295 in chromosome 28 , to describe the different mutation characteristics for low and high TMB  In contrast, many hypermutation regions (with intermutation distance < 10 4 bp) are present 301 in LIHC (see the red arrowheads in Fig. 6F and Fig. S17 for the detailed signatures in 302 chromosome 1). Such a non-trivial mutation profile for high TMB cancers provides a hint 303 for the power-law TMB distribution found in Fig. 5D. Similar profiles are also found for the 304 kataegis patterns based on the WGS data (see Fig. S18).  5E). We now assess the correlation between 314 the cancer risk and the mutation burden in cancer patients directly by taking data for the 315 16 cancers considered above from the TCGA and SEER database (see Tables IV, V and the   316 Materials and Methods). The cancer risk for males and females vary greatly (see Table V considered here as shown in Fig. 1-3), we might be able to exclude the influences of these 337 factors on cancer incidence, which would allow us to focus only on TMB.  As an example, we investigated the influence of TMB, and patient age on patient response 389 to immunotherapy (see Fig. S26 and Materials and Methods). Surprisingly, we found a much 390 higher fraction of older patients, compared to younger ones, showed favorable response 391 given a similar mutation burden level, which cannot be explained if only TMB is used as 392 a predictor of the efficacy of treatment. We explained the data using a theoretical model 393 based on the dynamics of mutation accumulation. We propose that older patients would 394 have accumulated more clonal mutations under the same TMB level, which could be the 395 underlying mechanism for the improved response.

396
To learn how the clonal characteristics of mutations influence patient survival further, 397 we utilized the multi-region sequencing data which distinguishes the clonal from subclonal 398 mutations better than single-region sequencing data 60 . We found that patients (lung adeno-399 carcinoma), with a high clonal tumor mutation burden (cTMB, number of clonal mutations), 400 show a better survival probability compared to those with a low cTMB (see Fig. 8A

548
Binning age groups does not change our results same as observed in previous studies 30 . 549 In addition, we found similar results by using all patient data from WGS (see Fig. 4 and 550 the Pearson/Spearman correlations shown Table VII in the Supplementary Information).

551
Besides, we found that a rather accurate mutation rate could be reached by binning the  The joint probability distribution for TMB and PAD 611 We also calculated the joint probability distribution P(X,Y) for the rescaled TMB can obtain F ini from N af and the total number of accumulated mutations (N t ) for patients 627 at the age of diagnosis (median value), which leads to One example for THCA is discussed in the main text, with µ ≈ 30 mutations/year, τ L = 629 5 − 20 years and N t ≈ 2100 mutations, which leads to F ini ≈ 71% ∼ 93% as listed in 630   Table II. For cancers with high TMB (shown in Fig. 2 Take LIHC as an example, N I ≈ 50 × 61 ≈ 3050 (61, the median age of the LIHC patients) 637 and N t ≈ 4 × 3000 ≈ 12000 (4, the TMB for LIHC patients at 61 years of age), which 638 leads to F ini ≈ 25% as listed in Table III. For a few cancers (Fig. 2) There is an age disparity for patients at diagnosis among different types of cancer between 647 the two sexes (see Table VI). In principle, we should not compare the cancer risks between influence. Therefore, the assumption that mutations accumulate in both the sexes in a 674 roughly similar manner, as assumed in the main text, holds.

675
Cancer risk and TMB without age-adjustment 676 We also examined the correlation between cancer risk and TMB by considering all the 677 mutation data obtained from TCGA database (see Table IV). As cancer risk for the two 678 sexes varies greatly, we included the data for both of them (see Table V)  reaches high values (see Fig. S25b). However, we neglected the known age disparity (see 696 Figs. S19 and S20) between female and male patients at diagnosis to obtain the results.

697
After we adjust the TMB by patient ages, as discussed in the main text, we find that the 698 mutation burden is the critical factor in determining the different risks between female and 699 23 male patients across many types of cancer.

700
TMB, patient age and immunotherapy for metastatic melanoma patients 701 We investigated the influence of TMB, and patient age on response to immunotherapy.

702
As an example, we first examined the age distribution for all the patients and the ones who benefit to immunotherapy 79 . We observed that the TMB is correlated with the patient age 711 for melanoma patients (see Fig. 3c). Thus, the patient age could also influence the responses 712 to immunotherapy. However, we find that the patients whose age is ≥ 65 show a similar 713 response (27%) as younger patients (22%).

714
In order to remove the influence of TMB and focus solely on age, we compared the where α A/B is the rate for the mutation accumulation in patient A or B. If T A > T B , then the 728 mutation rate obeys α A < α B because of N T for both the patients is the same. Therefore,         Table I shows the number of patients for each type of cancer used in our analyses for 1133 WES data. Table II and III give the fraction (F ini ) of accumulated mutations before the 1134 initiation of tumors for two categories of cancer shown in Figs. 1, 3 and Fig. 2, respectively.
1135 Table IV gives the median value of (both synonymous and non-synonymous) mutations per 1136 megabase for both sexes across 16 types of cancer.           for female is larger than that for male in this group of cancers. The median ages for female 1234 and male patients at diagnosis are listed in Table VI.