The theory of massively repeated evolution and full identifications of Cancer Driving Nucleotides (CDNs)

Tumorigenesis, like most complex genetic traits, is driven by the joint actions of many mutations. At the nucleotide level, such mutations are Cancer Driving Nucleotides (CDNs). The full sets of CDNs are necessary, and perhaps even sufficient, for the understanding and treatment of each cancer patient. Currently, only a small fraction of CDNs is known as most mutations accrued in tumors are not drivers. We now develop the theory of CDNs on the basis that cancer evolution is massively repeated in millions of individuals. Hence, any advantageous mutation should recur frequently and, conversely, any mutation that does not is either a passenger or deleterious mutation. In the TCGA cancer database (sample size n = 300 - 1000), point mutations may recur in i out of n patients. This study explores a wide range of mutation characteristics to determine the limit of recurrences (i*) driven solely by neutral evolution. Since no neutral mutation can reach i* = 3, all mutations recurring at i ≥ 3 are CDNs. The theory shows the feasibility of identifying almost all CDNs if n increases to 100,000 for each cancer type. At present, only < 10% of CDNs have been identified. When the full sets of CDNs are identified, the evolutionary mechanism of tumorigenesis in each case can be known and, importantly, gene targeted therapy will be far more effective in treatment and robust against drug resistance.


INTRODUCTION
Cancers are complex genetic traits with multiple mutations that interact to yield the ensemble of tumor phenotypes.The ensemble has been characterized as "cancer hallmarks" that include sustaining growth signaling, evading growth suppression, resisting apoptosis, achieving immortality, executing metastasis and so on (Hanahan and Weinberg 2000;Hanahan and Weinberg 2011;Hanahan 2022).It seems likely that each of the 6-10 cancer hallmarks is governed by a set of mutations.Most, if not all, of these mutations are jointly needed to drive the tumorigenesis.
In the genetic sense, cancers do not differ fundamentally from other complex traits whereby multiple mutations are simultaneously needed to execute the program.A well-known example is the genetics of speciation whereby interspecific hybrids are either sterile or infertile even though they do not have deleterious genes (Wu and Ting 2004;Wang et al. 2022;Wu 2022).A recent example is SARS-CoV-2.
The early onset of COVID-19 requires all four mutations of the D614G group and the later Delta strain has 31 mutations accrued in three batches (Ruan, Wen, et al. 2022;Ruan, Hou, et al. 2022;Cao et al. 2023;Ruan et al. 2023) While cancer research has often proceeded one mutation at a time, each of the mutations has been shown to be insufficient for tumorigenesis until many (Ortmann et al. 2015;Takeda 2021;Hodis et al. 2022) are co-introduced.
We now aim for the identification of all (or at least most) of the driver mutations in each patient.Both functional tests and treatments demand such identifications.The number of key drivers has been variously estimated to be 6 -10 (Martincorena et al. 2017;Anandakrishnan et al. 2019;Campbell et al. 2020).Although cancer driving "point mutations", referred to as Cancer Driving Nucleotides (or CDNs), are not the only drivers, they are indeed abundant (Campbell et al. 2020).Furthermore, CDNs, being easily quantifiable, may be the only type of drivers that can be fully identified (see below).Here, we will focus on the clonal mutations present in all cells of the tumor without considering within-tumor heterogeneity for now (Ling et al. 2015;Turajlic et al. 2019;Black and McGranahan 2021;B. Chen et al. 2022;Zhai et al. 2022;Bian et al. 2023;Zhu et al. 2023).
Since somatic evolution proceeds in parallel in millions of humans, point mutations can recur multiple times as shown in Fig. 1.The recurrences should permit the detection of advantageous mutations with unprecedented power.The converse should also be true that mutations that do not recur frequently are unlikely to be advantageous.Fig. 1 depicts organismal evolution and cancer evolution.While both panels A and B show 7 mutations, there can be two patterns for cancer evolution -the pattern in (C) where all mutations are at different sites is similar to the results of organismal evolution whereas the pattern in (D) is unique in cancer evolution.The hotspot of recurrences shown in (D) holds the key to finding all CDNs in cancers.
While many studies conclude that sites of mutation recurrence are largely mutational hotspots (Buisson et al. 2019;Hess et al. 2019;Stobbe et al. 2019;Nesta et al. 2021;Bergstrom et al. 2022), others deem them functional hotspots, driven by positive selection (Gartner et al. 2013;Chang et al. 2016;Bailey et al. 2018;Cannataro et al. 2018;Juul et al. 2021;Zhao et al. 2021;Zeng and Bromberg 2022).In the attempt to distinguish between these two hypotheses, studies make assumptions, often implicitly, about the relative importance of the two mechanisms in their estimation procedures.The conclusions naturally manifest the assumptions made when extracting information on mutation and selection from the same data (Elliott and Larsson 2021).
This study consists of three parts.First, the mutational characteristics of sequences surrounding CDNs are analyzed.Second, a rigorous probability model is developed to compute the recurrence level at any sample size.Above a threshold of recurrence, all mutations are CDNs.Third, we determine the necessary sample sizes that will yield most, if not all, CDNs.In the companion study, the current cancer genomic data are analyzed for the characteristics of CDNs that have already been discovered (Zhang et al. 2024).Together, these two studies show how full functional tests and precise target therapy can be done on each cancer patient.

RESULTS
In PART I, we search for the general mutation characteristics at and near high recurrence sites by both machine learning and extensive sequence comparisons.In PART II, we develop the mathematical theory for the maximal level of recurrences of neutral mutations (designated i * ).CDNs are thus defined as mutations with ≥ i * recurrences in n cancer samples.We then expand in PART III the theory to very large sample sizes (n ≥ 10 5 ), thus making it possible to identify all CDNs.
To carry out the analyses, we first compile from the TCGA database the statistics of multiple hit sites (i hits in n samples) in 12 cancer types.This study focuses on the mutational characteristics pertaining to recurrence sites.We often present the three cancer types with the largest sample sizes (lung, breast and CNS cancers), while many analyses are based on pan-cancer data.Analyses of every of the 12 cancer types individually will be done in the companion paper.The TCGA database is used as it is well established and covers the entire coding region (Weinstein et al. 2013).Other larger databases (Cerami et al. 2012;Tate et al. 2019;de Bruijn et al. 2023) are employed when the whole exon analyses are not crucial.
The compilation of multi-hit sites across all genes in the genome Throughout the study, Si denotes the number of synonymous sites where the same nucleotide mutation occurs in i samples among n patients.A i is the equivalent of S i for non-synonymous (amino acid altering) sites.Table 1 presents the numbers from lung cancer for demonstration.It also shows the A i and S i numbers with CpG sites filtered out.There are 22.5 million nonsynonymous sites, among which ~ 0.2 million sites have one hit (A 1 = 195,958) in 1035 patients.The number then decreases sharply as i increases.Thus, A2 = 2946 (number of 2-hit sites), A3 = 99, and A4 + A5 + ... = 79.We also note that the A i /S i ratio increases from 2.89, 2.82, 3.04 to 4.71 and so on.

Table 1
Fig. 2 Fig. 2 shows the average of A i and S i among the 12 cancer types (see Methods).The salient features are shown by differences between the solid and dotted lines.As will be detailed in PART II, the dotted lines, extending linearly from i = 0 to i = 1 in logarithmic scale, should be the expected values of A i and Si, if mutation rate is the sole driving force.In the actual data, A1 and S1 decrease to ~0.002 of A0 and S 0 , the step being least affected by selection (see PART II later).For A 2 and S 2 , the decrease is only ~0.01 of A1 and S1.The decrease from i to i+1 becomes smaller and smaller as i increases, suggesting that the process in not entirely neutral.Furthermore, the lower panel of Fig. 2 shows that Ai/Si continues to rise as i increases.These patterns again suggest a stronger positive selection at higher i values.The extrapolation lines shown in Fig. 2 roughly define i = 3 as a cutoff where the expected A 3 falls below 1 (see PART II for details).The precise model of PART II will define high recurrence sites (i ≥ 3) as CDNs.
PART I -The mutational characteristic of high recurrence sites In this part, the analyses are done in two different ways.The sequence-feature approach is to examine the mutation characteristics of sequence features (say, 3 mers, 5 mers, etc.) across patients.The patient-feature approach is to examine patients for their mutation signatures and mutation loads.
1.The sequence-feature approach The simplest and best-known sequence feature associated with high mutation rate is CpG sites.In mammals, methylation and de-amination would enhance the mutation rate from CpG to TpG or CpA by 5~10-fold (Hodgkinson and Eyre-Walker 2011;Ségurel et al. 2014).As the CpG site mutagenesis has been extensively reported, we only present the confirmation in the Supplement (Fig. S1).Indeed, CpG sites account for ~6.5% of the coding sequences but contributing ~22% among the mutated sites in Fig. 2. Hence, the 7-fold increase in the CpG mutation rate should contribute more to A i and S i as i increases.Table 1 has shown the effects of filtering out CpG sites in the counts of recurrences.Clearly, CpG sites do contribute disproportionately to the recurrences but, even when they are separately analyzed, the conclusion is unchanged.As shown later in PART II, every increment of i should decrease the site number by ~0.002 in the TCGA database.Thus, even with a 10-fold increase in mutation rate, the decrease rate would still be 0.02.In the theory sections, CpG site mutations are incorporated into the model.
In this section, we aim to find out how extreme the mutation mechanisms must be to yield the observed recurrences.If these mechanisms seem implausible, we may reject the mutational-hotspot hypothesis and proceed to test the functional hotspot hypothesis.
The analyses of mutability variation by Artificial Intelligence (AI) The variation of mutation rate at site level could be shaped by multiple mutational characteristics.
Epigenomic features, such as chromatin structure and accessibility, could affect regional mutation rate at kilobase or even megabase scale (Stamatoyannopoulos et al. 2009;Makova and Hardison 2015), while nucleotide biases by mutational processes typically span only a few base pairs around the mutated site (Roberts et al. 2013;Haradhvala et al. 2018;Herzog et al. 2021).AI-powered multi-modal integration offers a new tool to quantify the joint effect of various factors on mutation rate variability (Luo et al. 2019;Sherman et al. 2022;Song et al. 2023).Here we explore the association between the mutation recurrence (i) and site-level mutation rate predicted by AI.
Fig. 3 Fig.3A shows the mutation rate landscape across all recurrence sites in breast, CNS and lung cancer using the deep learning framework Dig (Sherman et al. 2022).In this approach, the mutability of a focus site is calculated based on both local stretch of DNA and broader scale of epigenetic features.The X-axis shows all mutated sites with i > 0 scanned by Dig.While the mutation rate fluctuates around the average level, we detect no significant difference in mutation rate as a function of i (Methods).In CNS, two sites exhibited exceptional mutability at i = 6, surpassing the average by 10-fold.Unsurprisingly, these two are CpG sites and correspond to amino acid change of V774M and R222C in EGFR (Supplement File S2), which are canonical actionable driver mutations in glioma target therapy.In other words, the two sites called by AI for possible high mutability appear to be selection driven.
In Fig. 3B, we take a closer look at how CDN is situated against the background of mutation rate variation, using the example of PAX3 (Paired Box Homeotic Gene 3) (Wang et al. 2008;Li et al. 2019).
In this typical example, Dig predicts site mutability to vary from site to site.In the lower panel is an expanded look at a stretch of 200 bps.In this stretch, about 8% of sites are 5 to 10-fold more mutable than the average.But none of them are mutated in the data (i = 0).There are indeed three sites with i > 0 in this DNA segment including a CDN site C1271T (marked by the red star).This CDN site is estimated to have a 2-fold elevation in mutability, which is less than 1/50 of the necessary mutability to reach i ≥ 3. The other two mutated sites, marked by the green star, are also indicated.
Other AI methods have also been used in the mutability analysis (Fang et al. 2022), reaching nearly the same conclusion.Overall, while AI often suggests sequence context to influence the local variation in mutation rate, the reported variation does not correspond to the distribution of CDNs.In the next subsection, we further explore the local contexts for potential biases in mutability.
The conventional analyses of local contexts -from 3-mers to 101-mers Since the AI analyses suggest the dominant role of local sequence context in mutability, we carry out such conventional analyses in depth.Other than the CpG sites, local features such as the TCW (W = A or T) motif recognized by APOBEC family of cytidine deaminases (Burns et al. 2013;Roberts et al. 2013), would have impacts as well.We first calculate the mutation rate for motifs of 3-mer, 5-mer and 7-mer, respectively, with 64, 1024, 16384 in number, (see Methods).The pan-cancer analyses across the 12 cancer types are shown in Fig. 4A.We use α to designate the fold change between the most mutable motif and the average.Since the number of motifs increases 16-fold between each length class, the α value increases from 4.7 to 8.8 and then to 11.5.Nevertheless, even the most mutable 7-mer¸ TAACGCG, which has a CpG site at the center, is only 11.52-fold higher than the average.This spread is insufficient to account for the high recurrences, which decrease to ~0.002 for each increment of i (see PART II below).

Fig. 4
A more direct approach is given in Fig. 4B explained in the legends.The absence of a trend shows that the high recurrence sites are not associated with the mutability of the motif.In Fig. 4C, the analysis extends to longer motifs of 21 bp, 41 bp, 61 bp, 81 bp and 101 bp surrounding the high-recurrence sites.For example, the motif of 101 bp may be (10, 90), (20, 80) and so on either side of a recurrence site.We then compute the pairwise differences in sequences of the motifs among recurrence sites.The logic is that, if certain motifs dictate high mutation rates, we may observe unusually high sequence similarity in the pairwise comparisons.As can be seen in all 5 panels of Fig. 4C, there are no outliers in the tail of the distribution.In other words, the sequences surrounding the high-recurrence sites appear rather random.Detailed motif analysis of CDNs within individual cancer types using deep learning models (ResNet, LSTM and GRU) further supports this conclusion.
In conclusion, the analyses by the sequence approach do not find any association between high recurrence sites (i ≥ 3) and the mutability of the local sequences.

The patient-feature approach
In this second approach, we examine the mutation characteristics among patients across sequence features.The first question is whether high recurrence sites tend to happen in patients with higher mutation loads.Fig. 5A depicts the distribution of mutation loads among patients harboring a CDN of recurrence i.Hence, a patient's load may appear several times in the plot, each appearance corresponding to one CDN in the patient's data.For the comparison across i values, the mutation load is normalized by a z-score within each cancer population to equalize the 3 cancer types.The overall trend shows consistently that patients with recurrence sites do not bias toward high mutation loads.
The presence of recurrence sites in patients with low mutation loads suggests that overall mutation burden is not a determining factor of recurrence.

Fig. 5
With the results of Fig. 5A, we then ask a related question: whether these high recurrence mutations are driven by factors that affect mutation characteristics.Such influences have been captured by the analyses of "mutational signatures" (Alexandrov et al. 2013;Alexandrov et al. 2020).Each signature represents a distinct mutation pattern (e.g., high rate of TCT -> TAT and other tri-nucleotide changes) associated with a known factor, such as smoking or an aberrant mutator (aristolochic acid, for example).Each patient's mutation profile can then be summarized by the composite of multiple mutational signatures.
The issue is thus whether a patient's CDNs can be explained by the patient's composite mutational signatures.Fig. 5B reveals that in lung cancer, the signature compositions among patients with different recurrence cutoffs are statistically indistinguishable (Methods).Smoking (signature SBS4) consistently emerges as the predominant mutational process across all levels of recurrences.In breast cancer, while SBS2 and SBS13 exhibit some differences in the bins of i * = 2 and i * = 3, the profiles remain rather constant for all bins of i * ≥ 3. The two lowest bins, not unexpectedly, are also different from the rest in the total mutation load (see Fig. 5A).In table S1, we provide a comprehensive review of supporting literature on genes with recurrence sites of i ≥ 3 for breast cancer.In CNS cancer, SBS11 appears significantly different across bins, in particular, i ≥ 20.This is a signature associated with Temozolomide treatment and should be considered a secondary effect.In short, while there are occasional differences in mutational signatures across i * bins, none of such differences can account for the recurrences (see PART II).
To conclude PART I, the high-recurrence sites do not stand out for their mutation characteristics.
Therefore, the variation in mutation rate across the whole genome can reasonably be approximated mathematically by a continuous distribution, as will be done below.
PART II -The theory of Cancer Driving Nucleotides (CDNs) We now develop the theory for S i and A i under neutrality where i is the recurrence of the mutation at each site.We investigate the maximal level of neutral mutation recurrences (i * ), above which the expected values of Si and Ai are both < 0. Since no neutral mutations are expected to reach i * , every mutation with the recurrence of i * or larger should be non-neutral.Importantly, given that the expected S i and A i is a function of U i where U = nE(u) is in the order of 10 -2 and 10 -3 , i * is insensitive to a wide range of mutation scenarios.For that reason, the conclusion is robust.
The mutation rate of each nucleotide (u) follows a Gamma distribution with a scale parameter θ and a shape parameter k.Gamma distribution is often used for its flexibility and, in this context, models the waiting time required to accumulate k mutations.Its mean (=kθ ) and variance (=kθ

2
) are determined by both parameters but the shape (skewness and kurtosis) is determined only by k.In particular, the Gamma distribution has a long tail suited to modeling a small number of sites with very high mutation rate.
We now use synonymous (Si) mutations as the proxy for neutrality.Hence, in n samples, where L S is the total number of synonymous sites and u(l) is the mutation rate of site l.In the equation above, the term ሾ1 െ ‫ݑ‬ሺ݈ሻሿ ି is dropped.We note that ሾ1 െ ‫ݑ‬ሺ݈ሻሿ ି ~݁ሾି௨ሺሻሺିሻሿ ~1 as u is in the order of 10 -6 and n is in the order of 10 2 from the TCGA data.With the gamma distribution of u The number of 2.3 is roughly the ratio of the total number of nonsynonymous over that of synonymous sites (Hartl and Clark 1989;Li 1997;Chen et al. 2019).This number would vary moderately among cancers depending on their nucleotide substitution patterns.
E(u) of Eqs.2-3 is generally (1~5) × 10 -6 per site in cancer genomic data and n is generally between 300 and 1000.Hence, nE(u) is the total mutation rate summed over all n patients and is generally between 0.001 and 0.005 in the TCGA data.Given nE(u) is in the order of 10 -3 , Si and Ai would both decrease by 2~3 orders of magnitude with each increment of i by 1.We note that the total number of synonymous sites, LS, is ~0.9 × 10 7 and LA is ~2.3 times larger.Therefore, S3 < 1 and A3 < 1.When i reaches 4, S i≥4 and A i≥4 would both be S 1 when averaged over cancer types.
For each cancer type, the conclusion of S3 < 1 and A3 < 1 is valid with the actual value of S3 and A3 ranging between 0.01 and 1.In other words, with n < 1000, neutral mutations are unlikely to recur 3 times or more in the TCGA data (i * = 3).While S i≥3 and A i≥3 sites are high-confidence CDNs, the value of i * is a function of n.At n ≤ 1000 for the TCGA data, i * should be 3 but, when n reaches 10,000, i * will be 6.The benefits of large n's will be explored in PART III.
Possible outliers to the distribution of mutation rate Although we have explored extensively the sequence contexts, other features beyond DNA sequences could still lead to outliers to mathematically distributions.These features may include DNA stem loops (Buisson et al. 2019) or unusual epigenetic features (Zheng et al. 2014;Makova and Hardison 2015;Supek and Lehner 2015).We therefore expand the model by assuming a small fraction of sites (p) to be hyper-mutable that is α fold more mutable than the genomic mean.Most likely, α and p are the inverse of each other.For the bulk of sites (1-p) of the genome, we assume that their mutations follow the Gamma distribution.(Nevertheless, the bulk can be assumed to have a fixed mutation rate of E(u) without affecting the conclusion qualitatively.) We let p range from 10 -2 to 10 -5 and α up to 1000.As no stretches of DNA show such unusually high mutation rate (1000-fold higher than the average), such sites are assumed to be scattered across the genome and are rare.With the parameter space of defined above, we choose the (p, α) pairs that agree with the observed values of S1 to S3 which are sufficiently large for estimations.Table 2 presents the value range and standard deviation for p and α across the 6 cancer types that have >500 patient samples.Among the 6 cancer types, the lung cancer data do not conform to the constraints and we set p = 0.With observed values for S 1~3 as constraints, S 4 rarely exceeds 1.Hence, even with the purely conjectured existence of outliers in mutation rate, i * = 4 is already too high a cutoff.
Table 2 Table 2 suggests that p has to be smaller than 10 -5 and α > 1000 to yield S 4 > 1.Since the coding region has 3×10 7 sites, p < 10 -5 would mean that the outliers are at most in the low hundreds.In other words, the number of high recurrence sites projected by the theory is close to the observed numbers.
Therefore, there are really no unknown outlier sites of high mutation rate.Positive selection would be a more straightforward explanation, explained below.

The influence of selection on mutation occurrences
We now show that, although the mutational bias alone cannot account for the high occurrences, selection can easily do so.We assume a fraction, f, of Ai's to be under positive selection.The fraction should be small, probably S 0.01, and will be labeled where G * is also a constant but its value depends on f.The crucial parameter, w (= 2Ns), is the selective advantage (s) scaled by the population size of progenitor cancer cell (N).Since w can easily be >10, even at i = 3, ‫כ‬ / would be >100 as large as ‫כ‬ / .In other words, observed mutation recurrences at i ≥ 3 for advantageous mutation should not be uncommon.Eq. 4 also shows that w and E(u) jointly affect the recurrence; therefore, CpG sites (many of which fall in functional sites) are expected to be strongly represented among high recurrence sites.
PART III.The theory of large samples (n > 10 5 ) and identification of all CDN's Using the theory of CDN developed above, the companion paper shows that each sequenced cancer genome in the current databases, on average, harbors only 1~2 CDNs.The number varies in this range depending on the cancer type (Zhang et al. 2024).For comparison, tumorigenesis may require at least 5~10 driver mutations as estimated by various criteria (Armitage and Doll 1954;Hanahan and Weinberg 2011;Belikov 2017;Martincorena et al. 2017;Anandakrishnan et al. 2019;Campbell et al. 2020).The results show that there are many more CDNs that have not been discovered.This is not unexpected since most CDNs are found in <1% of patients.If each CDN is observed in 1% of patients and each patient has 5 CDNs, then there should be at least 500 CDNs for each cancer type.
In the companion study that uses the A/S ratios of Fig. 1, the estimated number of CDNs ranges from 500 to 2000, whereas the current estimates based on A i≥3 sites is only 50~100 (Zhang et al. 2024).
Where, then, are the undiscovered CDNs and how to find them?Since all A i≥3 sites are concluded to be CDNs, the bulk of CDNs must be among A 1 and A 2 sites.The best way to identify the CDNs hidden in A 1 and A2 is to increase the sample size, n, dramatically.
We hence extend Eq. 1 for Si and Ai to large n's.Note that Eq. 1 drops the term of ሺ1 െ ‫ݑ‬ሻ ି ~݁ି௨ሺିሻ as it is ~ 1, when nE(u) S 1.With a large n when ݁ ି௨ሺିሻ is not near 1, the recurrence of mutations would follow a Poisson distribution with the expected value of nE(u).Assuming that u follows a gamma distribution with a shape parameter of k, the probability of observing i mutations would follow the negative binomial distribution as shown below: ݂൫݅|݇, ݊, ‫ܧ‬ሺ‫ݑ‬ሻ൯ ൌ ߁ሺ݅ ݇ሻ ߁ሺ݅ 1ሻ߁ሺ݇ሻ ݇ ሾ݊‫ܧ‬ሺ‫ݑ‬ሻሿ ሾ݇ ݊‫ܧ‬ሺ‫ݑ‬ሻሿ ିି Eq.
The cumulation density function for Eq. 6 is then: Then, by definition, A i≥i* should be S 1 so that mutations with recurrences i > i * could be defined as CDN.Thus: Where ε = A i≥i* denotes the number of sites with mutation recurrence ≥i * under the sole influence of mutational force.ఌ ಲ could then be regarded as significance of i * since it controls the overall false positive rate of CDNs.Specifically, with k = 1, the probability function of mutation recurrence of a given site would transform to a geometric distribution with p = 1 / (1 + nE(u)), the cumulative density function (CDF) is then: Combined with Eq. 5, the mutation recurrence cutoff i * of being a CDN could be expressed as: For very large n, 1 / nE(u) is small and i * /n can be approximated as ݅ ‫כ‬ / ൌ ݈‫݃‬ሺ ሻ ‫ڄ‬ ሺሻ~5 ൈ 10 ିହ Eq.
Eq. 11 shows that i * /n would approach ݈‫݃‬ሺ ሻ ‫ڄ‬ ሺሻ asymptotically as n increases.This asymptotic value is attained when n reaches ~10 6 .Fig. 6 shows the range of i * for n up to 10 6 .As expected, i * increases by small increments while n increases in 10-fold jumps.For example, when n increases by 3 orders of magnitude, from 100 to 100,000, i * only doubles from 3 to 12.The disproportional increment between i * and n explains why we use the actual number of i * for the cutoff, instead of the ratio i * /n.As shown in the inset of Fig. 2, the ratio of i * /n would approach the asymptote at n~10 6 , where an advantageous mutation only needs to rise to 0.00006 to be detected.With n reaching this level, we shall be able to separate most CDNs apart from the mutation background.
When n approaches 10 5 , the number of CDNs will likely increase more than 10-fold as conjectured in Fig. 7A.In that case, every patient would have, say, 5 CDNs that can be subjected to gene targeting.
(The companion study shows that, at present, an average cancer patient would have fewer than one targetable CDN.)Before the project is realized, it is nevertheless possible to test some aspects of it using the GENIE data of targeted sequencing.Such screening for mutations in the roughly 700 canonical genes serves the purpose of diagnosis with n ranging between 10~17 thousands for the breast, lung and CNS cancers.Clearly, GENIE efforts did not engage in discovering new mutations although they would discover additional CDNs in the canonical genes.In the companion study, we demonstrate that the analysis of CDNs identifies a potential set of 1.6 times more driver genes than those detected by whole gene selection signal calls (Zhang et al. 2024).
Fig. 7A assumes that the prevalence, i/n, should not be much affected by n, but the cutoff for CDNs, i * /n, would decrease rapidly as n increases.Fig. 7B shows that the i/n ratios indeed correspond well between TCGA and GENIE, which differ by 10 to 20-fold in sample size.Importantly, as predicted in Fig. 6, the number of CDNs increases by 3 to 5-fold (Fig. 7C -E).Many of these newly discovered CDNs from GENIE are found in the A1 and A2 classes of TCGA while many more are found in the A0 class in TCGA.In conclusion, increasing n by one to two orders of magnitude would be the simplest means of finding all CDNs.

DISCUSSION
The nature of high-recurrence mutations has been controversial.Many authors have argued for mutational hotspots (Hess et al. 2019;Stobbe et al. 2019;Nesta et al. 2021;Bergstrom et al. 2022;Wong et al. 2022) but just as many have contended that they are CDNs driven by selection (Gartner et al. 2013;Chang et al. 2016;Bailey et al. 2018;Cannataro et al. 2018;Juul et al. 2021;Zhao et al. 2021;Zeng and Bromberg 2022).While the two views co-exist, they are in fact incompatible.If the mutational hotspot hypothesis is correct, the selection hypothesis, and the determination of CDNs, would not be needed.
We believe that this study is the first to comprehensively test of the null hypothesis of mutational hotspots.In PART I, the mutational characteristics near all putative CDNs are examined and PART II presents the probability theory based on the analyses.The conclusion is that it is possible to reject the null hypothesis for recurrences as low as 3 in the TCGA data.The main reason for the high sensitivity is shown in Eqs. 2 and 3 where Ai and Si is proportional to ሾ‫۳ܖ‬ሺ‫ܝ‬ሻሿ or, roughly 0.002 i .We recognized that the conclusion is based on what we currently know about mutation mechanisms.In a sense, the theory developed here can help the search for such unknown mutation mechanisms, if they do exist.
Finally, the theory developed would permit the explorations in several new fronts when the sample size, n, expands to 10 5 .
The first front is to identify (nearly) all CDNs.When n reaches 10 5 , any point mutation with a prevalence higher than 12/100,000 would be a CDN, which is 25-fold more sensitive than in the TCGA data (3/1000).The companion analysis suggests that CDNs with lower prevalence, say 12/100,000 may still be highly tumorigenic in patients with the said mutation.If prevalence and potence are indeed poorly correlated, the search for lower prevalence CDNs by increasing n to 10 5 is equivalent to searching for less common but still potent cancer driving mutations.
The second one is functional tests in patient-derived cell lines.When we have all CDNs identified, a patient can be expected to have multiple (≥5; Zhang et al. 2024) CDNs.These mutations will be the basis of in vitro test, as well as in animal model experiments, by gene editing, as shown recently (Hodis et al. 2022).Targeting multiple mutations simultaneously is necessary and may even be sufficient.
The third front, and arguably the most important one, is cancer treatment by targeted therapy (Dang et al. 2017;Danesi et al. 2021;Waarts et al. 2022;Lin et al. 2023;Zhou et al. 2023).When multiple CDN mutations in the same patient can be simultaneously targeted, the efficacy should be high.No less crucial, resistance to treatment should be diminished since it would be harder on cancer cells to evolve multiple escape routes to evade multiple drugs.Moreover, CDN analysis is crucial for stratifying patients for targeted therapy, as only targeting genes that are positively selected during cancer evolution can truly achieve therapeutic effects.
There will be other fronts to explore with the full set of CDNs.A large database will facilitate the detection of negative selection which has eluded detection (Q.Chen et al. 2022).Chen et al. have analyzed a curious phenomenon in somatic evolution, which they term "quasi-neutral evolution" (Chen et al. 2019).It will also be possible to study the evolution of mutation mechanisms in cancer cells based on such large samples ( Jackson and Loeb 1998;Ruan et al. 2020).This last topic is addressed in the Supplement.Finally, at the center of evolutionary genetics is the multi-genic interactions that control complex phenotypes such as human diseases (e.g.diabetes) (Vujkovic et al. 2020;Lagou et al. 2023;Xue et al. 2023;Suzuki et al. 2024), genetics of speciation (Q.Chen et al. 2022;Pan, Zhang, et al. 2022) and the emergence of viral strains (Deng et al. 2022;Pan, Liu, et al. 2022).
Cancers may be the first such complex genetic systems that can be unraveled thanks to the massively repeated evolution.As cancer genomics is increasingly adopted in cancer treatment, these benefits should become apparent when n reaches 10 5 for most cancer types.
frequencies exceeding 0.0005 in any subpopulation annotated by gnomAD.To exclude patients exhibiting hypermutator phenotypes, mutation loads were scaled to the whole exon level with ݊ ෦ ൌ ݊ ‫ڄ‬ , where ݊ and l represent the mutation load and genomic length of target sequencing region, respectively.L denotes the genome-wide whole coding region length.Patients with ݊ ෦ 3000 were subsequently excluded, consistent with the threshold employed for the TCGA dataset.
Calculation of missense and synonymous site number (LA and LS) The idea of missense or synonymous sites originates from the question that how many missense or synonymous mutations would be expected if each site of the genome were mutated once given background mutation patterns.Here, the background mutation patterns refer to intrinsic biases in the mutational process, such as the over-representation of C>T (or G>A) mutations at CpG sites due to spontaneous deamination of 5-methylcytosine.In coding regions, four-fold degenerate sites are generally considered neutral, as any mutation path would not alter the encoded amino acid sequence.
This analysis follows established methods at the single-base level to infer the expected number of missense and synonymous sites across the genome (Gojobori et al. 1982;Wu and Maeda 1987;Hartl and Clark 1989;Martincorena et al. 2017).
To illustrate the calculation process, we provide an example of synonymous site estimation.At four-fold degenerate sites throughout the genome, we tally the number of mutations from base m to base v as ݊ வ௩ (݉, ‫ݒ‬ ‫א‬ ሼ‫,ܣ‬ ‫,ܥ‬ ‫,ܩ‬ ܶሽ, ݉ ് ‫.)ݒ‬The likelihood of observing a mutation from reference base m to variant base v will be ‫ݎ‬ வ௩ ൌ ಭೡ ே , where ܰ represents the number of four-fold degenerate site with reference base m.We then normalize all likelihoods by ܴ வ௩ ൌ ಭೡ ∑ , where ∑ ‫ݎ‬ represents the sum of likelihoods across 12 possible mutation paths, ܴ வ௩ thus describes the relative probability of an occurred mutation to be ݉ ‫ݒ‬ at any site.For a given genomic coding region of length L, the synonymous site number will be: Eq. S1 with ߜ வ௩ ௦௬ being a Kronecker delta function where: Similarly, the expected number of missense sites LA, is calculated as follows: ‫ܮ‬ ൌ ߜ வ௩ ௦ ܴ வ௩ Eq. S2

Calculation of Ai and Si
For each mutated site, we track its number of recurrent mutations i (i > 0) across two mutation categories: missense and synonymous.Subsequently, we aggregate across entire coding region to count the number of sites harboring i missense mutations (Ai) and synonymous mutations (Si).For i = 0, we define A 0 and S 0 as the estimated number of potential missense and synonymous sites that remain unmutated within the current sample size.‫ܣ‬ ൌ ‫ܮ‬ െ ∑ ‫ܣ‬

AI-based mutation rate analysis
To capture the complex interplay of genomic and epigenomic factors influencing mutation susceptibility, we employed pre-trained artificial intelligence (AI) models from Dig, an aggregated tool combining deep learning and probabilistic models (Sherman et al. 2022).Downloaded from the Dig data portal (http://cb.csail.mit.edu/cb/DIG/downloads/),these models leverage a rich set of features encompassing both kilobase-scale epigenomic context (replicating timing, chromatin accessibility, etc.) and fine-grained base-pair level information (such as sequence context biases) to predict the site level mutation rate.For each cancer type, we re-fitted the pre-trained models with mutations analyzed in our study.The mutation rate for each site, scaled by population size, was obtained via the elementDriver function within Dig, and was represented by EXP_SNV from the final results.For a closer look at mutation rate landscape of PAX3, we re-fitted the AI-model with point mutations from large intestine cancer.The mutation rates were generated site-by-site for the coding regions of PAX3.
Given the scarcity of mutated sites with recurrence i ≥ 3 (comprising only 0.15% of all mutated sites), a rigorous statistical approach was adopted to assess the significance of mutability differences between these high-hit groups and low-hit groups.We implemented the procedure as follows: 1).Raw significance level: For each recurrence group i (containing A i sites), a one-sided Kolmogorov-Smirnov (K-S) test was employed to calculate a raw significance level (denoted as ‫‬ ) against the low-hit group.
2).Resampling for Significance Pool: we resample Ai sites from the entire pool of mutated sites with missense mutations.The significance ‫‬ from one-sided K-S test is calculated against the low hit group.The resampling process was repeated 100,000 times, generating a distribution of resampled significance levels, denoted as ൛‫‬ , ݆ ൌ 1,2, . .,100,000ൟ.
3).Adjusted Significance Level: the raw significance ‫‬ was then compared to the resampled significance pool ൛‫‬ , ݆ ൌ 1,2, . .,100,000ൟ.The proportion of ‫‬ ‫‬ was then calculated as the adjusted significance level that accounted for potential sampling effects.

Motif based mutability
For a given nucleotide base, we extended the sequence to each side by 1, 2, and 3 base pairs, producing sets of 3-mer, 5-mer, and 7-mer motifs, respectively.We then pooled point mutations from 12 cancer types to create a comprehensive dataset.For 3-mer and 5-mer motifs, we utilized synonymous mutations as the reference for mutation rate calculations.For 7-mer motifs, the vast number of possible sequence combinations (4 7 = 16,384) posed a challenge, as synonymous mutations alone might not adequately cover all potential contexts.To address this, we employed all singleton mutations (1-hit mutations) from both missense and synonymous categories for 7-mer motif analysis.
This decision was based on the assumption that singleton mutations are less affected by selective pressures, supported by the genome-wide observation that the ratio of missense to synonymous singletons (A 1 /S 1 ) approximates the ratio of unmutated missense to synonymous sites (A 0 /S 0 ).
The site number for 3-mer and 5-mer motifs of a given context c was calculated as follows: Significance for motif enrichment (Fig. 4B) mirrored the AI analysis.For each i ≥ 3 site, we calculated raw K-S p-values against motif mutabilities (denoted as ‫‬ ).These were then compared to a resampled significance pool ൛‫‬ , ݆ ൌ 1,2, . .,100,000ൟ, with the proportion of ‫‬ ‫‬ employed as the final p-value, depicting enrichment significance for highly mutable motifs in recurrence group i against low hits.

Consensus length comparison
To explore potential sequence motifs associated with recurrent mutations (i ≥ 3), we employ a sliding window of 10 bp stride to extract the local context from reference genome (Fig. S5).We examined diverse window sizes (21, 41, 61, 81, and 101 bp) to capture potential motifs of varying lengths and distances to the mutated site.Consensus length of local contexts was measured by Hamming distance in pairwise comparisons of aligned windows (with same stride) between mutated sites.
To prioritize sequence similarities likely driven by mutational mechanisms rather than functional constraints or gene structure, we restricted consensus comparisons to non-homologous genes.This approach effectively mitigated potential biases arising from homologous genes (e.g., KRAS and NRAS) or repeated domains within a single gene (e.g., FBXW7).The statistical significance of observed consensus lengths was assessed using the K-S test, which compared the empirical distribution of consensus lengths against a Poisson distribution, with mean of λ set to one-quarter of the window size, which reflects the expected distribution under random scenarios.an increased exposure to SBS11 (Blough et al. 2011;Lin et al. 2021;Noeuveglise et al. 2023).Again patients with higher mutation recurrences do not differ in their mutation signatures.] when n reaches 10 6 .In short, more CDNs (those with lower prevalence) will be discovered as n increases.Beyond n = 10 6 , there will be no gain.

Figure 1 :Figure 2 :Figure 3 :Figure 4 :Figure 5 :
Figure 1: Two modes of DNA sequence evolution.(A) A hypothetical example of DNA sequences in organismal evolution.(B) Cancer evolution that experiences the same number of mutations as in (A but with many short branches.(C) A common pattern of sequence variation in cancer evolution.(D cancer evolution, the same mutation at the same site may occasionally be seen in multiple sequencThe recurrent sites could be either mutational or functional hotspots, their distinction being the ma objective of this study. Figure 6: i * values (Y-axis, log scale) against sample sizes (n, X-axis) across different shape parame k's.The Y axis presents the i * values under different sample sizes (n of the X-axis) in log scale.Five shape parameters (k) of the gamma-Poisson model are used.In the literatures on the evolution of mutation rate, k is usually greater than 1.The inset figure illustrates how i * / n (prevalence) would decrease with increasing sample sizes.The prevalence would approach the asymptotic line of[ eter

Figure 7 :
Figure 7: Analysis of CDNs with expanded sample set in GENIE.(A) Schematic illustrating the impa of sample size expansion on the number of discovered CDNs.The two vertical lines show the cutoff i * /n at (3/1000) vs. (12/100,000).The Y axis shows that the potential number of site would decrea with i * /n, which is a function of selective advantage.The area between the two cutoffs below the lin represent the new CDNs to be discovered when n reaches 100,000.The power of n = 100,000 is eve larger if the distribution follows the blue dashed line.(B) The prevalence (i/n) of sites is well correlated between datasets of different n (TCGA with n < 1000 and GENIE with n generally 10-fold higher), as it should be.Sites are displayed by color."1-hit": CDNs identified in GENIE but remain in singleton in TCGA, "2-hit": CDNs identified in GENIE but present in doubleton in TCGA."CDN both": CDNs identified in both databases.(C-E) CDNs discovered in GENIE (n > 9000) but absent in TCGA < 1000).The newly discovered CDNs may fall in TCGA as 0 -2 hit sites.The numbers in the middle column show the percentage of lower recurrence (non-CDN) sites in TCGA that are detected as CDN in the GENIE database, which has much larger n's.

Table 1 :
An example of A i and S i (from lung cancer, n = 1035).Note -The ratio of Ai/ Si is provided as a measure of selection strength.

Table 2 :
Summary for modeling outlier sites in 6 cancer types For each cancer type, p stands for the proportion of highly mutable sites, with mutation rate being α-fold of the average.S 3 gives the expected number without mutable outliers (p = 0).S 4 and S 5 denote the expected number with the best (p, α) pairs with the standard deviation in parentheses.For lung cancer, S 2 and S 3 do not fit the outlier model (Supplement File S1); therefore, we set p = 0.