Extreme value theory as a general framework for understanding mutation frequency distribution in cancer genomes

Currently, there is no recognized population genetics framework describing the population dynamics of cancer cells that is applicable to real cancer genome data. By focusing on cancer as a Darwinian evolutionary system, we formulated and analyzed the observed mutation frequency among tumors (MFaT) as a proxy for the hypothesized sequence read frequency and beneficial fitness effect of a cancer driver mutation. Analogous to intestinal crypts, we assumed that sample donor patients are separate culture tanks where proliferating cells follow certain population dynamics described by extreme value theory (EVT). To validate this, we analyzed three large-scale cancer genome datasets, each harboring > 10 000 tumor samples and in total involving > 177 898 observed mutation sites. We clarified the necessary premises for the application of EVT in the strong selection and weak mutation (SSWM) regime in relation to cancer genome sequences at scale. We also confirmed that the stochastic distribution of MFaT is likely of the Fréchet type, which challenges the well-known Gumbel hypothesis of beneficial fitness effects. Based on statistical data analysis, we demonstrated the potential of EVT as a population genetics framework to understand and explain the stochastic behavior of driver-mutation frequency in cancer genomes as well as its applicability in real cancer genome sequence data.

Variant Allele Frequency in Cancer Genomics and Population VAF = #{mutated reads} / #{total reads} (1) Recent studies have shown that the neutral evolution of the cancer genome results in 8 a power law distribution of tumor bulk-sample VAFs reported by a next-generation 9 sequencer [1,2]. However, to date, no sophisticated statistical framework exists for 10 understanding the mutation frequency distribution in the cancer genome that is 11 consistent with population genetics and can be ultimately used to describe the 12 population dynamics of cancer cells. 13 The Big Bang Model and Preclonal Evolution of Cancer 14 According to the "Big Bang" model of human colorectal tumor growth, multiple tumor 15 subclones are formed with a single expansion of cell population at an early stage of 16 tumor growth [3]. In this model, these cells not only acquire clonal lesions that are 17 shared among many cells within the tumor but also acquire subclonal lesions that are 18 observed only in a fraction of those cells at almost the same time. These subclonal 19 lesions are not subjected to stringent selection in cancer evolution (i.e., neutral 20 evolution), thus leading to a state of intra-tumor heterogeneity. However, it is known 21 that cancer as a whole is a Darwinian evolutionary system, in which driver mutations 22 undergo selective evolution and passenger mutations undergo neutral evolution [4]. 23 Preclonal evolution before the expansion of the cell population is thought to be selective, 24 with cancer cells acquiring driver mutations that extremely increase cellular fitness [5]. 25 Cancer Driver Mutations in the 26 Strong-Selection-Weak-Mutation Regime 27 All cancer driver mutations are beneficial for the survival of cancer cells, and acquisition 28 of these mutations brings fitness gain to cancer cells. Among population genetics 29 theories that are able to describe the behavior of the fitness effects of beneficial 30 mutations, the strong-selection-weak-mutation (SSWM) regime produces a 31 first-approximation fitness landscape of beneficial mutations [6]. This framework 32 assumes that all beneficial mutations that occur in the population will increase fitness 33 monotonically: i.e., they are on "selectively accessible" paths. Thus, their adaptive walk 34 is typically short. 35 SSWM is characterized by two key assumptions: strong selection and weak 36 mutation [7]. Here, "strong selection" is equivalent to excluding neutral mutations from 37 consideration. Because all cancer driver mutations are beneficial, the strong selection 38 assumption holds for these mutations. In contrast, "weak mutation" means excluding 39 cases in which multiple mutations exist simultaneously within the population. In this 40 assumption, we consider only a genotype that is made by introducing a single mutation 41 to a single wild-type genome. We model population adaptation as a repetition of this wherein mutations with greater fitness effects are certainly fixed in the population, 53 resulting in increased repeatability of the fitness trajectory of cells in such a situation, 54 has been confirmed by an Escherischia coli experiment [9]. The fact that a limited 55 number of driver mutations has been observed repeatedly in multiple cancer samples 56 suggests not only the repeatability of the oncogenic process, but also the underlying 57 evolutionary structure itself. 58 An increase in population size not only strengthens the deterministic traits of the 59 population by clonal interference but also enables its fitness valley crossing [6,10]. This 60 gives rise to an escape genotype that is not on the selectively accessible paths in SSWM, 61 thereby counteracting the deterministic traits of the population. Stochastic tunneling, a 62 critical element in the valley-crossing mechanism, enables fixation of deleterious 63 mutations as well as neutral evolution [11]. These theoretical aspects of cell population 64 behavior have been validated by E. coli experiments [12]. In cancer, such population 65 dynamics corresponds to the subclonal evolution of passenger mutations, leading to 66 intra-tumor heterogeneity. 67 Among the evolutionary models mentioned above, this manuscript focuses on SSWM 68 to explain the acquisition of driver mutations by cancer cells. According to the 69 mutational landscape by Gillespie, the fitness of a wild type allele at a given gene locus 70 tends to lie on the right side of a distribution because this fitness is usually high, and 71 the fitness of beneficial mutations lie further to the right [7,13]. In this setting, the 72 fitness values are extreme, and thus their statistical behavior is described by extreme 73 value theory (EVT) [14]. 74 The first application of EVT to cancer preclonal evolution examined the 75 proliferation of stem cells in intestinal crypts [15]. Within the internal space of an 76 individual crypt, the adaptation and evolution of the cell population will be completely 77 independent of the states of different crypts. 78 In our analysis, we employ an analogy that compares an intestinal crypt and a mutations are annotated as being in either the "missense mutation," "nonsense 108 mutation," "translation start site," or "nonstop mutation" class. We defined mutations 109 with VAFs in the range of [0.25, 1.00] as clonal after a pioneering study in the field [1]. 110 We used the following cancer driver-gene definitions in our analysis: the Mutational 111 Driver definition in the IntOgen Cancer Drivers Database (2014.12) dataset [19], the 112 COSMIC Cancer Gene Census [17], and the Tokheim Oncogenes and Tumor Suppressor 113 Genes [20]. Also, we used the following data as cancer driver mutation site definitions: 114 SNVs whose driver activity is "known" in the IntOgen 2016.5 Driver Mutations Catalog 115 Mutation Analysis dataset [21], SNVs in the DoCM database [22], and amino acid 116 substitution information per gene in a recent study (Table S4 in [23]).

118
In the case of the ICGC dataset, we used the Release 27 Summary Simple Somatic 119 Mutation VCF file, for which the reference genome build is GRCh37 and mutation 120 annotations are based on Ensembl version 75. First, we selected protein-altering 121 mutations (annotated as either "missense variant," "stop gained," "stop lost," or 122 "initiator codon variant") in the VCF file using our custom script (see S1 Script: 123 get IcgcProteinAltering.py). If a mutation had multiple annotated transcripts, and it 124 had been annotated as a protein-altering mutation in any of those transcripts, we 125 included such a mutation in the analysis because it will satisfy the strong selection 126 assumption in cancer driver genes. Next, we selected doubleton (affected donors > 1) 127 SNV records and then calculated and maximized their MFaTs (donor-based MFaTs).

128
For the ease of the process, we added gene symbols as an independent column that had 129 been a part of the VCF annotations (based on Ensembl 75)(see S1 File: 130 database ICGC temp PostMax.tsv.gz). The fields of the affected donor count 131 (affected donors) and the total donor count (total donors, 15 285) are accepted as the 132 mutated tumor count and the total tumor count in an MFaT calculation, respectively. 133 In the case of the COSMIC dataset, we used the GRCh37 Version 85 Mutant Export 134 TSV file that includes mutations from cancer cell lines. First, we selected SNVs that 135 were doubletons or more frequent (CNT > 1) from the coding mutation VCF (with 136 redundancy due to annotations) based on mutation IDs. Among those mutant export 137 TSV records that had mutation IDs, we analyzed only those obtained from genome 138 sequencing data ("Genome-wide screen" == "n") mapped to the GRCh37 genome build 139 (GRCh == "37"). We summed mutation IDs based on tumor IDs and calculated the 140 MFaTs for the respective mutations (tumor-based MFaTs). For these mutation records, 141 we added genomic coordinates using the VCF file, formatted the records appropriately, 142 and then removed redundancies due to annotated gene symbols. Finally, we selected (affected tumors) and the total tumor count (total tumors, 24 355) were accepted as the 147 mutated tumor count and the total tumor count in an MFaT calculation, respectively. 148 In the case of the CHANG dataset, we used the PanCancer Unfiltered MAF file.

149
Out of all mutations recorded in the file, we selected only records with mutations that 150 were protein-altering (annotated as either "missense variant," "initiator codon variant," 151 "stop gained," or "stop lost") SNVs and potential doubletons according to summed The fields of the affected sample count (affected samples) and the total sample count 156 (total samples, 11 089) were accepted as the mutated tumor count and the total tumor 157 count in the MFaT calculation, respectively. 158 We selected driver mutations according to gene symbols in the case of the 159 driver-gene definitions and to genomic coordinates in the case of driver-site definitions. 160 We generated driver-gene definition files for the IntOgen, CGC, and Tokheim  In the case of the RTCGA mutations dataset, we used a reformatted file by selecting 168 the necessary columns after processing the header row (i.e., we renamed a column from 169 "Start position" to "Start Position") (see database RTCGA temp Format.tsv.gz).

170
Further, we select protein-altering SNV records whose reference gnome build is  In the case of RTCGA total-tumor analysis, we summed sample IDs over all tumor 183 types (affected samples) across the RTCGA dataset, calculated MFaTs, and 184 subsequently maximized them (see S11 File: database RTCGA temp PostMaxTotal.tsv). 185 The method of extracting the intersection set of mutations about the driver-gene and 186 driver-site definitions is similar to the cases of ICGC and other datasets.

187
In the case of RTCGA type-specific analysis, we summed sample IDs, calculated 188 MFaTs, and maximized them in a tumor-type-specific manner (affected samples)(see 189 S12 File: database RTCGA temp PostMaxType.tsv). We used only driver-gene  infinite.

226
In addition, we assume that the mutational selection coefficient of a given site is 227 maximized among possible alternate values after evolutionary selection (the 228 maximization assumption).
where S denotes a mutational selection coefficient, i is the index for genomic sites, To suffice for the maximization assumption of MFaT observations, we selected the 234 maxima of MFaT values in each genomic site, and excluded the rest from our analysis. 235 As a result, the counts of MFaT values, mutations, and sites will all be equal. Here, we 236 define this process as "maximization" of MFaTs, which will enable more exact and 237 reliable data processing in future analyses. In the case of a tumor-type-specific analysis, 238 we performed this maximization process for the respective tumor types.

239
Parameter Estimation 240 We estimated the generalized extreme value distribution (GEV) and generalized Pareto 241 distribution (GPD) parameters by the maximum likelihood (Nelder-Mead optimization) 242 method using the "evd" R package (i.e., evd::fgev, evd::fpot functions). Initial values for 243 the optimization are set in the GEV parameter estimation in the total-tumor analysis For parameter estimation of the Pareto distribution, we defined its probability 249 density function (PDF) and maximum likelihood estimators as the following, referring 250 to implementations in the "VGAM" R package. Here, n is the length of the observed 251 data vector.
Goodness-of-Fit Assessment by χ 2 Goodness-of-Fit Test

255
To assess goodness-of-fit of the GEV and Pareto distributions, we performed a 256 conventional χ 2 goodness-of-fit test over the total-tumor and tumor-type-specific cancer 257 driver MFaTs using the "stats" R package (stats::chisq.test function; rescale.p and In Bayesian extreme value analysis, we estimated tumor-type-specific GEV parameters 311 using the Markov Chain Monte Carlo (MCMC) approach. We used the "evdbayes" R 312 implementation for this purpose. Prior distributions of respective GEV parameters (i.e., 313 shape, scale, and location) were assumed independent and normal. This means that the 314 variance-covariance matrix that is used to calculate prior distribution is diagonal. Also, 315 this distribution is equal to the trivariate (i.e., with three variables) normal distribution 316 with variables that are mutually independent. In addition, we determined the prior 317 parameters regarding the results of the bootstrap simulation, which are dependent on 318 observed data. This means that the prior distribution used in this step is "informative." 319 Specifically, the normal parameters in the informative priors were set as follows After the MCMC simulation, we estimated tumor-type-specific GEV parameters 330 using an expectedà posteriori (EAP) estimator.
With this definition, we formulate the selection coefficient whereby the cell's genome 336 is mutated from a pre-mutation (or reference) sequence R (with length 1 bp) to a Then, we consider cancer cell environments that determine selective pressure 339 throughout cancer evolution. We consider the set of all possible environmental states Θ 340 and its elements θ that determine the value of a selection coefficient of a cancer cell 341 with a given genotype. Next, we define the selection coefficient whereby the cell 342 mutates from a pre-mutational sequence R to a post-mutational sequence A at a 343 genomic site i within a given environmental state θ: In addition, we assume that the fitness effect of a cancer driver mutation after , whose values are unique to cancer driver mutation sites, is defined by: Here, D is the set of post-mutational DNA sequences that are possible at the site i, 351 and d is its single element. D is dependent on the genomic site i and the scope of  demonstrating that S i converges to the generalized extreme value distribution (GEV) as 365 n → ∞. Finally, as this limit holds due to the infinite micro-environments assumption, 366 the probability distribution of S i will be GEV over the set of genomic sites J(i ∈ J).

367
Here, we assume that MSC at a site i is proportional to the MFaT of the site i (the 368 proportionality assumption).
We generalize this by considering normalizing constants of MFaTs, σ 0 and µ 0 .
In general, fitness W is unobservable, and so is the relative fitness The prior distribution of S i is expressed using the PDF of GEV f GEV (s) with three 383 parameters (shape ξ, scale σ, and location µ) as follows: Here, A(s) is a function of s, the argument of the PDF. The likelihood at each value 385 of tumor sample counts, k = 0, 1, 2, . . . , m, is expressed using the probability mass 386 function (PMF) of the binomial distribution f Binom (k) as follows: From Bayes theorem and appropriate assumptions over PMF and PDF, we assume 388 the following equation over the posterior distribution of S i Here, the necessary assumptions are For the stepwise calculation of P (X = k), we need to consider tumor sample counts 392 at all genomics sites. However, this is easily achieved by normalization of the numerator 393 P (X = k|S i = s)P (S i = s). Finally, we estimate S i by the expectedà posteriori (EAP) 394 estimation and obtain its 95 % confidence interval (EAP ± 1.96 APSD) by calculating 395 theà posteriori standard deviation (APSD).

397
The Definition of MFaT

398
The mutant allele frequency among tumors (MFaT) is the frequency of samples within a 399 considered sequencing dataset that have a given mutation at a given genomic site. The 400 MFaT value is defined at any genomic sites corresponding to individual mutations 401 recorded in the dataset. In this formulation, the count of samples having a mutation is 402 normalized by the number of total samples within the dataset, permitting a comparison 403 of mutated sample frequencies between different datasets.

404
MFaT = #{mutated tumors} / #{total tumors} (20) In this study, we calculated the MFaTs of only protein-altering SNVs that had     (Tables 1 and 2). In this statistical test, the null hypothesis H 0 that may be rejected is 433 "the observed distribution is identical with the theoretical," and the alternative  We then asked whether driver MFaT distributions were also of the Fréchet type in each 447 tumor type. We used the RTCGA dataset for this analysis regarding data availability. 448 The results of the "total"-tumor analysis using the RTCGA dataset (Figs 3A and 3B) 449 confirmed that estimated values of GEV parameters in the case of the RTCGA dataset 450 are reproducible and independent of differences in driver-gene definitions, as shown with 451 the ICGC, COSMIC and CHANG datasets (Figs 1C and 2C). Tumor-type-specific 452 analyses of eight tumor types (Figs 3C and 3D) confirmed that the results were similar 453 to the results of the total-tumor analysis. Parameter estimation by Bayesian MCMC 454 (Fig 3C) showed the MFaT distributions belonged to the Fréchet type, although some 455 degree of variability in GEV parameters according to the differences in tumor types 456 were observed. In addition, the comparison of histograms of observations to estimated 457 densities (Fig 3D) confirmed that the probability distribution of GEV approximately 458 describes the actual distributions of type-specific driver MFaTs. The arrangements of 459 data points in Fréchet plots are approximately linear (R 2 > 0.90 in any of eight tumor 460 types), suggesting that driver MFaTs follow the Fréchet distribution ( Fig 3E). Finally, 461 the results of the χ 2 goodness-of-fit test did not reject the null hypothesis that the 462 distribution is GEV, except for the case of THCA (Table 3). Collectively, the parameter estimation by the Bayesian MCMC approach confirmed 464 that the probability distributions of tumor-type-specific driver MFaTs are also of the 465 Fréchet type, as shown for the total tumors (Figs 1, 2, 3A, and 3B).

474
In this series of analyses, we were able to compare these mutational fitness effects 475 estimated by mutation frequencies between tumor types, because the estimators utilize 476 MFaTs normalized by respective sample counts. This normalization thus far is 477 independent of the classes of genes and other features of interest. Thus, EVT as a field 478 of population genetics is consistent with the quantitative comparison of mutational 479 fitness effects among tumor types involving both oncogenes and tumor suppressor genes. 480 For example, we estimated the MSCs of BRAF mutations among various tumor 481 types ( Fig 4A). The estimated MSC was the highest for BRAF V600E in thyroid cancer 482 (THCA), followed by BRAF V600E in skin cutaneous melanoma (SKCM). The BRAF 483 In these analyses, only mutations extracted with driver-gene definitions are analyzed. In the total-tumor analysis, we used the IntOgen, CGC, and Tokheim driver-gene definitions, and in the case of tumor-type-specific analyses, we used only the IntOgen definition.
(A) Density plot of MFaTs in the cases of the RTCGA total-tumor driver-gene definitions. The colored solid line is the probability density of observations, the black solid line is the probability density function (PDF) of GEV, and the dotted lines are the PDFs of the GPD and Pareto distributions, respectively. Here, "b" denotes the number of genomic sites of beneficial mutations considered, and "th" denotes the effective threshold against MFaTs when selecting mutations according to ranks (for details, see Materials and Methods). In the literature, the impact on the fitness effect of the BRAF V600E mutation is 486 likely different among tumor types, including thyroid cancer, skin cutaneous melanoma, 487 and lung adenocarcinoma. The mutation has been associated with poor prognosis and 488 mortality in patients with papillary thyroid cancer [25][26][27]. This is likely the strongest 489 association between these three tumor types. The prevalence of this mutation is 490 reported also in melanoma [28] and is shown to induce metastasis of melanoma in 491 mice [29]. However, this is conditional to PTEN loss, suggesting weaker association 492 compared to the case of thyroid cancer. The contribution of this mutation is even 493 weaker in lung adenocarcinoma [30]. The estimated MSCs of BRAF mutations among 494 tumor types are consistent with these known facts.

495
Although not much is known for the impact of each mutations on the fitness effect 496 among the tumor types for other genes in Fig 4A, oncogenes (i.e., HRAS and PIK3CA) 497 tend to have a few "mutation-tumor type" combinations that show relatively high 498 MSCs. In contrast, in the case of tumor suppressor genes (i.e., CDKN2A, CTNNB1, 499 and SPOP), MSCs were small and the differences between "mutation-tumor type" 500 combinations are also small. This tendency is evident in the case of TP53 (Fig 4B).  If these two important aspects of cancer driver mutation VAFs are considered, the 516 value of aggregated tumor VAF (i.e., VAF that is aggregated across tumor samples in a 517 given dataset; for details, see Supplementary Materials in S1 Appendix) will be equal to 518 the value of mutant allele frequency among tumors (MFaT), which is given by the ratio 519 of mutated samples to total samples. The use of MFaT will thus be a powerful approach 520 in normalizing, investigating, and deciphering the records of preclonal evolution in 521 large-scale cancer genome data.

522
The Nine Assumptions in Extreme Value Cancer Genomics 523 With the framework of SSWM (strong selection and weak mutation) in population 524 genetics [6,7], we were able to mechanistically and stochastically describe the preclonal 525 evolution of cancer. To achieve this and perform valid extreme value analysis over 526 cancer driver MFaTs, we propose the following nine assumptions. These assumptions 527 will specify the scope of the application of the theory and will enable precise 528 interpretation of the results. progressive processes are repeatable (the macro-repeatability assumption). This 534 also is a prerequisite for two other assumptions (i.e., the proportionality 535 assumption and the maximization assumption) stated below. 536 2. The Infinite Micro-Environments Assumption: In a microscopic point of 537 view that focuses on patients' genetic background, physical condition, tissue type, 538 and tissue micro-structure, as well as the genetic diversity of cancer itself, the 539 uncertainty of evolutionary processes, and many other critical aspects of cancer 540 evolution, we have infinite cases of possible cancer micro-environments. This is a 541 prerequisite of the independent and identical distribution assumption and the 542 maximization assumption stated below.  beneficial or deleterious, and no neutral mutations are considered [7]. Because we 552 presume driver mutations are all beneficial in cancer evolution [4], we can safely 553 accept this assumption over driver mutations. In the analysis, the validity of this 554 assumption is ensured by removing mutations with lower observed fitness gains.

555
For the case of total-tumor analysis with driver-gene definitions, we focused on So far, the fitness effect that a mutation yields in cancer evolution is only 574 indirectly observable. Thus, the precise formulation of the probability distribution 575 of that variable is unknown as well as its existence. However, the repeatability of 576 cancer evolution implies that such fitness effects by a mutation have a certain 577 probability distribution, and the complexity suggests that the variable is 578 approximately continuous. From the above, we assume that the mutational fitness 579 gains in cancer evolution have a certain continuous probability distribution. then the phenotypic effects of the two mutations also vary. This is because 583 different genomic sites encode different structures and functions of the organism. 584 For example, mutations in the first and third letters of triplets in the codon table 585 will yield different amino acid substitutions (i.e., the first-and third-letter 586 substitutions are independent). In contrast, the fitness effect that a phenotypic 587 effect of a given mutation confers on the organism is dependent on the 588 environment in which the organism adapts and evolves. From the 589 above-mentioned infinite micro-environments assumption, we have numerous cases 590 of such environments in cancer evolution. Under these possible environments, we 591 assume that the fitness effect of a given, single mutation has a certain probability 592 distribution that is independent of a genomic site of the mutation (i.e., any given 593 mutation have identical probability distribution). Then, the value of the fitness 594 effect of a mutation is independent and identically distributed (i.i.d.) across 595 genomic sites. This is equivalent to excluding cases of interaction of mutation 596 effects (i.e., epistasis) from the scope. Here, from the single macro-environment assumption, we consider adaptation of 603 cancer cells to the single cancer macro-environment in the preclonal evolution step. 604 We consider that, under such a macro-environment, cancer cells are selected based 605 on combinations of different micro-environments and different fitness effects of 606 cancer driver mutations. Thus, we assume that mutational fitness gain at a given 607 cancer driver site is maximized across possible values (the block maxima model) 608 after selection in the preclonal evolution. This is consistent with the idea of 609 "survival of the fittest" in the theory of natural selection.

610
S i = max(S i,1 , S i,2 , ..., S i,n ) (n → ∞) Here, S is a selection coefficient, i is an index for genomic sites, and n is the 611 number of possible micro-environments.

612
Although some of the above assumptions may not fit with our current knowledge 613 of caner biology, the results of our analysis suggest that they may hold at least for 614 the first approximation. from that distribution will be one of these three types: Gumbel, Fréchet, or Weibull [31]. 628 Many "ordinary" probability distributions, such as normal, exponential, and gamma, 629 belong to the Gumbel maximum domain of attraction. Based on this fact and discussion 630 that Fréchet-type and Weibull-type distributions are not "biological," Gumbel-type 631 distributions have been justified as distributions of fitness effects of beneficial 632 mutations [32]. In addition, a historical background in which such a fitness effect 633 distribution has been considered to be exponential also supported this preconception 634 (Fisher's geometric model; [24,33]). However, recent theoretical advances clarified that 635 distributions that belong to the Fréchet and Weibull domains are also possible [14].  Also, an Escherichia coli experiment designed as an application of EVT empirically 648 confirmed that the fitness effects of fixed beneficial mutations follow a distribution with 649 a positive mode [39]. Although experimental settings including the method to quantify 650 mutant fitness are greatly different from this study, the Fréchet distribution as a 651 statistical distribution that describes the behavior of fitness effects of fixed driver 652 mutations in tumor samples also has a positive mode in its mathematics. 653 Our study suggested that the distribution of fitness effects of driver mutations 654 calculated from a sample frequency in a large-scale sequence dataset is of the Fréchet 655 type (Figs 1-3), while it also allows distributions of the fitness effects of the individual 656 mutations to remain unknown. The results of goodness-of-fit tests (Tables 1-3)  present a problem to the previously held Gumbel hypothesis [14] from a practical point 664 of view but also suggest the applicability of the Fréchet distribution in cancer genomics 665 (Fig 4). In the case of THCA (Table 3), the null hypothesis was rejected in the 666 goodness-of-fit test and it did not reproduce this result. It is obvious from the graph 667 that this irreproducibility is due to the lack of mutations used in the analysis (Fig 3E). 668 The Applicability of Extreme Value Theory in Cancer Genomics 669 The posterior distributions of tumor-type-specific mutational selection coefficients 670 (MSCs) of driver mutations calculated by the GEV-binomial model (Fig 4) contain 671 information of distribution tails described by EVT. In the violin plots, because the EAP 672 estimates drawn as white dots contain information of the tails that cannot be handled 673 by a simple binomial model, those estimates have shifted to the right from the central 674 point, as suggested by shapes of the posterior distributions. Such shifts are significant 675 in posterior distributions of mutations, such as the S33P mutation in the CTNNB1 gene 676 in the LIHC tumor type and the Q61R mutation in the HRAS gene in the THCA tumor 677 type (Fig 4A). Similarly, while EAP estimates of driver mutation MSCs in the TP53 678 gene strongly reflect MFaTs that are mutant sample frequencies, these values also 679