Abstract
Cancer genomes exhibit surprisingly weak signatures of negative selection1,2. This may be because tumors evolve under weak selective pressures (‘weak selection’) or because genome-wide linkage in cancer prevents most deleterious mutations from being removed due to Hill-Robertson interference3 (‘inefficient selection’). The weak selection model argues that most genes are only important for multicellular function and that selection acts only on a subset of essential genes. In contrast, the inefficient selection model predicts that only cancers with low mutational burdens, where linkage effects are minimal, will exhibit strong signals of negative selection against deleterious passengers and positive selection for beneficial drivers. We leverage the 10,000-fold variation in mutational burden across cancer subtypes to stratify tumors by their genome-wide mutational burden and used a normalized ratio of nonsynonymous to synonymous substitutions (dN/dS) to quantify the extent that selection varies with mutation rate. We find that appreciable negative selection (dN/dS ~ 0.4) is present in tumors with a low mutational burden, while the remaining cancers (96%) exhibit dN/dS ratios approaching 1, suggesting that the majority of tumors do not remove deleterious passengers. A parallel pattern is seen in drivers, where positive selection attenuates as the mutational burden of cancers increases. Both trends persist across tumor-types, are not exclusive to essential or housekeeping genes, are present in clonal and subclonal mutations, and persist in Copy Number Alterations. A consequence of this inability to remove deleterious passengers is that tumors with elevated mutational burdens, which are expected to harbor substantial protein folding stress, upregulate heat shock pathways. Finally, using evolutionary modeling, we find that Hill-Robertson interference alone can reproduce the patterns of attenuated selection observed in both drivers and passengers if the average fitness cost of passengers is 1.0% and the average fitness benefit of drivers is 19%. As a result, despite the weak individual fitness effects of passengers, most cancers harbor a large mutational load (median ~40% total fitness cost). Collectively, our findings suggest that the lack of observed negative selection in most tumors is not due to relaxed selective pressures, but rather the inability of selection to remove individual deleterious mutations in the presence of genome-wide linkage.
Introduction
Tumor progression is an evolutionary process acting on somatic cells within the body. These cells acquire mutations over time that can alter cellular fitness by either increasing or decreasing the rates of cell division and/or cell death. Mutations which increase cellular fitness (drivers) are observed in cancer genomes more frequently because natural selection enriches their prevalence within the tumor population1,2. This increased prevalence of mutations across patients within specific genes is used to identify driver genes. Conversely, mutations that decrease cellular fitness (deleterious passengers) are expected to be observed less frequently. This enrichment or depletion is often measured by comparing the expected number of nonsynonymous mutations (dN) within a region of the genome to the expected number of synonymous mutations (dS), which are presumed to be neutral. This ratio, dN/dS, is expected to be below 1 when the majority of nonsynonymous mutations are deleterious and removed by natural selection, be approximately 1 when all nonsynonymous mutations are neutral, and can be greater than 1 when a substantial proportion of nonsynonymous mutations are advantageous.
Two recent analyses of dN/dS patterns in cancer genomes found that for most non-driver genes dN/dS is ~1 and that only 0.1 - 0.4% of genes exhibit detectable negative selection (dN/dS < 1)1,2.This differs substantially from patterns in human germline evolution where most genes show signatures of negative selection (dN/dS ~ 0.4)1. Two explanations for this difference have been posited. First, the vast majority of nonsynonymous mutations may not be deleterious in somatic cellular evolution despite their deleterious effects on the organism. While most genes may be critical for proper organismal development and multicellular functioning, they may not be essential for clonal tumor growth. In this hypothesis, negative selection (dN/dS < 1) should be observed only within essential genes and absent elsewhere (dN/dS ~ 1). While appealing in principle, most germline selection against nonsynonymous variants appears to be driven by protein misfolding toxicity4,5, in addition to gene essentiality. These damaging folding effects ought to persist in somatic evolution.
A second hypothesis is that even though many nonsynonymous mutations are deleterious in somatic cells, natural selection fails to remove them. One possible reason for this inefficiency is the unique challenge of evolving without recombination. Unlike sexually-recombining germline evolution, tumors must evolve under genome-wide linkage that creates interference between mutations, known as-Hill-Robertson interference, which reduces the efficiency of natural selection3. Without recombination to link and unlink combinations of mutations, natural selection must act on entire genomes — not individual mutations — and select for clones with combinations of mutations of better aggregate fitness. Thus, advantageous drivers may not fix in the population, if they arise on an unfit background, and conversely, deleterious passengers can fix, if they arise on particularly fit backgrounds.
The inability of asexuals to eliminate deleterious passengers is driven by two Hill-Robertson interference processes: hitchhiking and Muller’s ratchet (Fig. 1A). Hitchhiking occurs when a strong driver arises within a clone already harboring several passengers. Because these passengers cannot be unlinked from the driver under selection, they are carried with the driver to a greater frequency in the population. Muller’s ratchet is a process where deleterious mutations continually accrue within different clones in the population until natural selection is overwhelmed. Whenever the fittest clone in an asexual population is lost through genetic drift, the maximum fitness of the population declines to the next most fit clone (Fig. 1A). The rate of hitchhiking and Muller’s ratchet both increase with the genome-wide mutation rate6,7. Therefore, the second hypothesis predicts that selection against deleterious passengers should be more efficient (dN/dS < 1) in tumors with lower mutational burdens.
Here, we leverage the 10,000-fold variation in tumor mutational burden across 50 cancer types to quantify the extent that selection attenuates, and thus becomes more inefficient, as the mutational burden increases. Using dN/dS, we find that selection against deleterious passengers and in favor of advantageous drivers is most efficient in low mutational burden cancers. Furthermore, low mutational burden cancers exhibit efficient selection across cancer subtypes, as well as within subclonal mutations, homozygous mutations, somatic copy-number alterations, and essential genes. Additionally, high-mutational burden tumors appear to mitigate this deleterious load by upregulating protein folding and degradation machinery. Finally, using evolutionary modeling, we find that Hill-Robertson interference alone can explain these observed patterns of selection. Modeling predicts that most cancers carry a substantial deleterious burden (~40%) that necessitates the acquisition of multiple strong drivers (~5) in malignancies that together provide a benefit of ~130%. Collectively, these results explain why signatures of selection are largely absent in cancers with elevated mutational burdens and indicate that the vast majority of tumors harbor a large mutational load.
Results
A nonparametric null model of mutagenesis in cancer
Mutational processes in cancers are heterogeneous, which can bias dN/dS estimates of selective pressures. To overcome this issue, it is essential to design a bias-corrected version of dN/dS in which observed counts are compared to what is expected under neutral evolution. It is also important to consider that mutational biases are often specific to cancer type and genomic region. Such corrections are generally accomplished using parametric mutation models, which can become very complex in cancer (exceeding 5,000 parameters in some cases1,8).
To circumvent these issues, we use a permutation-based, nonparametric (parameter-free) estimation of dN/dS. In this approach, every observed mutation is permuted while preserving the gene, patient samples, specific base change (e.g. A>T) and its tri-nucleotide context. Note that permutations do not preserve the codon position of a mutation and thus can change its protein coding effect (nonsynonymous vs synonymous). The permutations are then tallied for both nonsynonymous dN(permuted) and synonymous dS(permuted) substitutions (Fig. S1) and used as expected proportional values for the observed number of nonsynonymous dN(observed) (or simply dN) and synonymous dS(observed) (dS) mutations in the absence of selection. The unbiased effects of selection on a gene, dN/dS, is then:
For all cancer types and patient samples, P-values and confidence intervals are determined by bootstrapping patient samples. Note that this permutation procedure will account for gene and tumor-level mutational biases (e.g. neighboring bases9, transcription-coupled repair, S phase timing10, mutator phenotypes) and their covariation. We confirmed that this approach accurately measures selection in the presence of simulated mutational biases (Methods, Fig. S2) and variation in gene length (Fig. S3), and demonstrate that this approach identifies similar patterns of selection as parametric models (Fig. S3)1.
Attenuation of selection in drivers and passengers for elevated mutational burden tumors
We estimated dN/dS patterns in both driver and passenger gene sets across 11,855 tumors from TCGA (whole-exome) and ICGC (whole-genome) aggregated over 50 cancer types (Methods). We used the following four mutational tallies as a proxy for the genome-wide mutation rate: (1) the total number of mutations or tumor mutational burden (TMB) (2) the total number of observed substitutions in both synonymous and nonsynonymous sites (dN + dS) (Fig. 1), and (3) the total number of mutations in intergenic, and (4) intronic regions. All estimates are strongly correlated (R2 > 0.97, Fig. S4).
In principle, only the last two tallies — the number of substitutions in intergenic or intronic regions — are orthogonal to dN/dS, and least likely to be biased by selection. However, these measures can only be applied to whole-genome datasets, which constitute only 15% of sequenced samples. Therefore, for most of the analyses, we used the second measure (dN + dS) to define mutational burden, while being cognizant that the analysis could be complicated by the fact that the same mutation tallies are used for both the x-axis (dN + dS) and y-axis (dN/dS). We note that this interdependence leads to a slight underestimation of the degree of purifying selection, rendering our analysis conservative (Fig. S5, Methods).
Consistent with the inefficient selection model, whereby selection fails to eliminate deleterious mutations in high mutational burden tumors, we observe pervasive selection against passengers exclusively in cancers with low mutational burdens (dN/dS ~ 0.4 in tumors with mutational burden ≤ 3, while dN/dS ~ 0.9 in tumors with mutational burden > 10, Fig. 2A). We observed little negative selection in passengers when aggregating tumors across all mutational burdens (dN/dS = 0.88), which is broadly similar to previous estimates1,2,8,11. Also consistent with the inefficient selection model, drivers exhibit a similar but opposing trend of attenuated selection at elevated mutational burdens (dN/dS ~ 3.5 when mutational burden ≤ 3 and gradually declines to ~1.38 when mutational burden > 100). This pattern is not specific to oncogenes or tumor suppressors (Fig. S6). While the attenuation of selection against passengers in higher mutational burden tumors is a novel discovery, this pattern among drivers has been reported previously1. We confirmed that these patterns are robust to the choices that we made in our analysis pipeline, including the: (1) somatic mutation calling algorithm (Mutect2 and MC3 SNP calls12, Fig. S3B), (2) dataset (TCGA13, ICGC14, COSMIC15 and an additional independent validation cohort; Fig. S3B and Fig. S3D), (3) effects of germline SNP contamination (Fig. S7), (4) choice of driver gene set (Bailey et al16, IntOGen17, and COSMIC15, Fig. S3B and Fig. S8), (5) mutational burden metric (Fig. S3A), (6) differences in tumor purity and thresholding (Fig. S9), and (7) null model of mutagenesis (dNdScv, Fig. S3C & S10)1 (Methods).
If negative selection is more pronounced in low mutational burden tumors, then the nonsynonymous mutations observed should also be less functionally consequential. By annotating the functional effect of all missense mutations using PolyPhen218 (Fig 2B), we indeed find that observed nonsynonymous passengers are less damaging in low mutational burden cancers. Similarly, driver mutations become less functionally consequential as mutational burden increases, as expected for mutations experiencing inefficient positive selection (Fig 2B). Together these two trends provide additional and orthogonal evidence that selective forces on nonsynonymous mutations are more efficient in low mutational burden cancers.
Since all mutational types experience Hill-Robertson interference, attenuated selection should also persist in Copy Number Alterations (CNAs). Since CNAs cannot be partitioned into synonymous and nonsynonymous events, but can still disrupt protein function and dosage, we quantified selection in CNAs using two alternative measures: Breakpoint Frequency19 and Fractional Overlap20. For both measures, we compare the number of CNAs that either terminate (Breakpoint Frequency) within or partially overlap (Fractional Overlap) Exonic regions of the genome relative to non-coding (Intergenic and Intronic) regions (dE/dI, See Methods). Like dN/dS, dE/dI is expected to be <1 in genomic regions experiencing negative selection, >1 in regions experiencing positive selection (e.g. driver genes), and approximately 1 when selection is absent or inefficient (Fig. S23). Using dE/dI, we observe attenuating selection in both driver and passenger CNAs as the total number of CNAs increases for both Breakpoint Frequency (Fig. 2C) and Fractional Overlap (Fig. S11). While CNAs of all lengths experience attenuated selection, CNAs longer than the average gene length (>100 KB) experience greater selective pressures in drivers (p < 10−4).
Collectively, these results suggest that tumors with elevated mutational burdens carry a substantial deleterious load. Since nonsynonymous mutations are thought to be primarily deleterious by inducing protein misfolding4,5, we tested whether an increase in the number of passenger mutations in tumors would lead to elevated protein folding stress, and, in turn, drive the upregulation of heat shock and protein degradation21 pathways in cancer22. Indeed, gene expression of HSP90, Chaperonins, and the Proteasome does increase across the whole range of SNV (weighted R2 of 0.83, 0.77, and 0.75 respectively) and CNA burdens (weighted R2 of 0.78, 0.87 and 0.84, respectively) (Fig. 2G and S22). This trend persists across cancer types for SNVs and CNAs (Fig. S22). Importantly, expression of these gene sets increases across the whole range of mutational burdens, even after dN/dS approach 1. This result presents additional evidence that passengers continue to impart a substantial cost to cancer cells, even in high mutational burden tumors, which must be overcome for tumors to progress.
Strong selection in low mutational burden tumors cannot be explained by mutational timing, gene function, nor tumor type
We next tested alternative hypotheses to the inefficient selection model. We considered the possibility that selection is strong only during normal tissue development, but absent after cells have transformed to malignancy. This would disproportionately affect low mutational burden tumors, as a greater proportion of their mutations arise prior to tumor transformation. If true, then attenuated selection should be absent in sub-clonal mutations, which must arise during tumor growth. However, selection clearly attenuates for the subset of likely subclonal mutations with Variant Allele Frequency (VAF) below 20% (Fig. 2D & S12). Although selection attenuates in drivers and passengers in both sub-clonal and clonal mutations, selection is weaker in both drivers and passengers with lower VAFs. Weaker efficiency of selection among less frequent polymorphisms is expected under a range of population genetic models23 and especially so in rapidly-expanding, spatially-constrained cancers24. In addition, heterozygous mutations, which are only partially-dominant25, are also expected to exhibit lower VAFs.
Next, we considered and rejected the possibility that attenuated selection is limited to particular types of genes. We first annotated our observed mutations by different functional categories and Gene Ontology (GO) terms26 and find that negative selection is not specific to any particular gene functional category, and specifically not limited to essential or housekeeping genes — a key prediction of the ‘weak selection’ model1 (Fig. S13, p < 0.05, Wilcoxon signed-rank test).
Finally, we found that these patterns of attenuated selection persist across cancer subtypes for both SNVs and CNAs. We calculated dN/dS in tumors grouped by nine broad anatomical sub-categories (e.g. neuronal) and 50 subtype classifications 27(Fig. 2E-F). We find that patterns of attenuated selection in SNVs persists in the broad and specific (drivers p = 1.4 × 10−5, passengers p = 1.3 × 10−2,,Wilcoxon signed-rank test; Fig. S14) classification schemes. Furthermore, dE/dI measurements of CNAs exhibit these same patterns of selection in broad (Fig. S15) and specific subtypes (Fig. 2F; drivers p < 10−6 and passengers p = 7.3 × 10−4). Collectively, these results strongly support the inefficient selection model and argue that the observed patterns must be due to a universal force in tumor evolution.
Evolutionary modeling estimates the fitness effects of drivers and passengers, and rate of Hill-Robertson interference processes
Our findings indicate that selection consistently attenuates in both drivers and passengers across all cancers as mutational burden increases. To determine whether Hill-Robertson interference alone can explain these findings, we modeled tumor progression as a simple evolutionary process with advantageous drivers and deleterious passengers. We then used Approximate Bayesian Computation (ABC) to compare these simulations to observed data and infer the mean fitness effects of drivers and passengers.
Our evolutionary simulations model a well-mixed population of tumor cells that can randomly acquire advantageous drivers and deleterious passengers during cell division28. The product of the individual fitness effects of these mutations determines the relative birth and death rate of each cell, which in turn dictates the population size N of the tumor. If the population size of a tumor progresses to malignancy (N > 1,000,000) within a human lifetime (≤100 years), the accrued mutations and patient age are recorded. The mutation rate of each simulated tumor is randomly-sampled from a broad range (10−12 to 10−7 mutations • nucleotide−1 • generation−1, Methods).
Figure 3A illustrates the ABC procedure. To compare our model to observed data, we simulated an exponential distribution of fitness effects with mean fitness values that spanned a broad range (10−2 - 100 for driver and 10−4 - 10−2 for passengers, Methods). We summarized observed and simulated data using statistics that capture three relationships: (i) the dependence of driver and passenger dN/dS rates with mutational burden, (ii) the rate of cancer age-incidence (SEERs database29), and (iii) the distribution of mutational burdens (summary statistics of (ii) and (iii) were based on theoretical parametric models30, Methods, Fig. S16 & S17). We then inferred the posterior probability distribution of mean driver fitness benefit and mean passenger fitness cost using a rejection algorithm that we validated using leave-one-out Cross Validation (Methods, Fig. S18).
Using this approach, the Maximum Likelihood Estimate (MLE) of mean driver fitness benefit is 18.8% (Fig. 3B), while the MLE of passenger mean fitness cost is 0.96% (Fig. 3C). Simulations with these MLE values agree well with all observed data (Fig. 3D-F, Pearson’s R = 0.95, 0.80, 0.99, 0.97 for driver dN/dS, passenger dN/dS, Age-Incidence, and Mutational Burden respectively).
While Hill-Robertson interference alone explains dN/dS rates in the passengers well, the simulations most consistent with observed data still exhibited consistently higher dN/dS rates in drivers (Fig. 3D). We tested whether positive selection on synonymous mutations within driver genes could explain this discrepancy. Indeed, we find that a model incorporating synonymous drivers agrees modestly better with observed statistics (p = 0.043, ABC posterior probability). The best-fitting model predicts that ~10% of synonymous mutations within driver genes experience positive selection, which is consistent with previous estimates for human oncogenes31 (Methods, Fig. 3D, S19). Furthermore, we observe additional evidence of selection and codon bias in synonymous drivers exclusive to low mutational burdens (TCGA samples, Methods, Fig. S19). Lastly, we considered and rejected the possibility that the attenuation of selection in drivers could be due to a diminishing benefit of additional drivers (akin to a 5-hit multistage model30, Methods, p > 0.5, ABC posterior probability).
Our results indicate that rapid adaptation through natural selection – acting on entire genomes, rather than individual mutations – is pervasive in all tumors, including those with elevated mutational burdens. Given the quantity of drivers and passengers observed in a typical cancer (TCGA), we estimate that cancer cells are in total ~90% fitter than normal tissues (130% total benefit of drivers, 40% total cost of passengers). These values are larger than estimates from evolutionary models that assume that passengers are neutral (~0.001%)32, but of the same order of magnitude as estimates from models that assumed passengers were deleterious (~10%)33. Furthermore, direct experimental measurements in Cre-inducible mouse models of tumorigenesis also find similarly strong driver benefits at 1-27% 34–36. A median of five drivers accumulate per tumor in these simulations – also consistent with estimates from age-incidence curves and known hallmarks of cancer37. Lastly, the mutation rates of tumors that could progress to cancer in our model also recapitulate observed mutation rates in human cancer38 (median 3.7 × 10−9, 95% Interval 1.1 × 10−10 - 8.2 × 10−8, Fig. S20).
Most notably, aggregate passenger load confers a fitness cost of ~40%. While this collective burden is large, the individual fitness effects of accumulated passengers in these simulations (mean 0.8%) are similar to observed fitness costs in cancer cell lines (1 - 3%)39 and the human germline (0.5%)40. These passengers accumulated primarily via Muller’s Ratchet, while only ~14% accumulated via hitchhiking (inferred using population genetics theory28 and MLE fitness effects, Methods, Fig. S21).
Discussion
Here we argue that signals of selection are largely absent in cancer because of the inefficiency of selection and not because of weakened selective pressures. In low mutational burden tumors (≤ 10 total substitutions per tumor), increased selection for drivers and against passengers is observed and ubiquitous: in SNVs and CNAs; in heterozygous, homozygous, clonal, and subclonal mutations; and in mutations predicted to be functionally consequential. These trends are not specific to essential or housekeeping genes. Importantly, these patterns persist across broad and specific tumor subtypes. Collectively, these results suggest that inefficient selection is generic to tumor evolution and that deleterious load is a nearly-universal hallmark of cancer.
Importantly, these patterns of selection are missed when dN/dS rates are not stratified by mutational burden. Since only 0.1% of mutations in TCGA and ICGC reside within low mutational burden tumors (4% of all tumors, N=563), the dN/dS of passengers at low mutational burdens (~0.4 - 0.8) do not appreciably alter the pan-cancer dN/dS of passengers (0.88 in our study, 0.82 — 0.98 in 1,2,8,11). Thus, these patterns can only be detected now given the vast amounts of available cancer sequencing data. While only 4% of tumors exhibit substantial negative selection, selection in drivers, selection on CNAs, and expression patterns of chaperones and proteasome components all show a continuous response to deleterious passenger load across a broad range of mutational burdens. Collectively, this suggests that passengers continue to be deleterious even in high mutational burden tumors. Nevertheless, we believe that low mutational burden tumors are uniquely valuable for identifying genes and pathways under positive and negative selection.
Using a simple evolutionary model, we show that Hill-Robertson Interference alone can explain this ubiquitous trend of attenuated selection in both drivers and passengers. dN/dS rates attenuate in drivers because the background fitness of a clone becomes more important than the fitness effects of an additional driver at elevated mutation rates. Furthermore, these simulations indicate that, despite dN/dS patterns approaching 1 in tumors with elevated mutational burdens, passengers are not effectively neutral (Ns > 1). Instead, passengers confer an individually-weak, but collectively-substantial fitness cost of ~40% that measurably impacts tumor progression. While this simple evolutionary model does not explicitly incorporate many known aspects of tumor biology (e.g. haploinsufficiency, see Table S2), we note that selection’s efficiency in cancer is further reduced when spatial constraints are considered24.
The functional explanation for why passengers in cancer are deleterious is unknown. In germline evolution, mutations are believed to be primarily deleterious because of protein misfolding4,5. Deleterious passengers in somatic cells should confer similar effects. Indeed, we find that elevated mutational burden tumors may buffer the cost of deleterious mutations by upregulating multiple heat-shock pathways. However, deleterious passengers may carry additional costs to cancers (e.g. immunoediting41) or be buffered by additional mechanisms. Understanding and identifying how tumors manage this deleterious burden should identify new cancer vulnerabilities that enable new therapies and better target existing ones41–43.
Methods & Supplementary Materials
Data Availability
Exonic, open-access SNV calls (WES) of 10,486 cancer patients in (The Cancer Genome Atlas) TCGA were downloaded from the Multi-Center Mutation Calling in Multiple Cancers (MC3) project12. This repository uses a consensus of seven mutation-calling algorithms. Whole-Genome Sequencing SNV calls (WGS) of 1,830 patients were downloaded from the ICGC data portal in November 201844. Supplemental analyses on the effect of variant callers, SNVs from exome and whole genome wide screens were downloaded on October 2016 from the Catalog of Somatic Mutations in Cancer’s (COSMIC) Mutant Export Census15. Expression data of SNVs were downloaded from the Genotype-Tissue Expression (GTEx) project (v7 release)45. All CNAs were downloaded from the COSMIC database on June 201515. Gene expression data compared to CNAs was downloaded from the COSMIC database on September 2019. To validate our findings, additional WES and WGS SNV calls were downloaded from cBioPortal from 1,786 treatment-naive, tumor-normal sample pairs across 17 studies of varying cancer types in February 2019. Formalin-Fixed Paraffin Embedded (FFPE) samples were removed. 46,47,56–62,48–55
Code Availability
All code for the simulations, associated theoretical analysis, and generation of summary statistics will be made publicly available under the open-source MIT License upon publication. Code for simulations of tumor growth with advantageous drivers and deleterious passengers is currently available at https://github.com/mirnylab/pdSim.
Mutation calling and quality controls
Mutations were downloaded from online repositories that have already invested heavily in quality control. Multiple data repositories were used to ensure reproducibility. Post-processing was minimal to avoid engendering a particular result, and only excluded sequencing samples obtained from cell lines, or studies that did not report synonymous variants, or (on occasion) mutations within pseudogenes. These exclusions are described in greater detail below.
Somatic Nucleotide Variants (SNVs)
Only consensus mutation calls from the PCAWG Consensus SNV-MNV caller were considered. Both missense and nonsense mutations are defined as nonsynonymous mutations. Frameshift, indels, and splice-site variants were not included in analyses. Samples without any synonymous or nonsynonymous mutations and unexpressed genes in either dataset were excluded. Note that there is no evidence of germline contamination by common SNPs (MAF > 5%) from 1,000 Genomes Project63 (v 2015 Aug) using ANNOVAR64 to annotate mutations in either datasets (Fig. S7). A final of 1,703 whole-genome and 10,152 whole-exome sequencing samples were used for the analyses in this paper. In SNV data collected from COSMIC, studies before 2010 that didn’t report silent mutations, and cell lines were removed from analysis. Whole-exome SNVs in TCGA were also called using Mutect265 (Fig. S3B).
Defining tumor burden
We tested four different mutation burden metrics as a proxy for the genome-wide mutation rate: (1) the total number of observed mutations, (2) total number of substitutions in both synonymous and nonsynonymous sites (dN(observed) + dS(observed)), (3) the total number of mutations in intergenic, and (4) intronic regions. Although only the last two definitions of mutational burden are completely independent to dN/dS, the vast majority of samples (10,152 vs 1,703) are derived from whole-exome data. We note that all mutation rates are strongly correlated to each other (R2 > 0.97). Because only dN + dS could be applied to WES data — the majority of samples — and all metrics worked equally-well, we primarily used dN + dS to measure mutational burden. Lastly, because dN/dS is undefined for tumors with no synonymous mutations, we necessarily excluded these samples. We also excluded samples with no nonsynonymous mutations so as to apply a symmetric filter on the data and because data quality may be compromised in these samples. Inclusion of samples with zero synonymous mutations or zero nonsynonymous mutations did not appreciably alter observed trends in the TCGA and ICGC datasets (Fig. S5D).
A Nonparametric Null Model of Mutagenesis to calculate dN/dS
We assume that for any particular tumor, mutation rates are constant across a gene for a particular tri-nucleotide context and base change (e.g. C > G). Our procedure is inspired by Constrained Marginal Models (or ‘edge switching’ in network analysis), whereby the marginal distributions of observations aggregated over known confounding variables are preserved under permutation to create a null distribution. In our application of this strategy, the marginal distributions of mutations (across tri-nucleotide context, base change, gene, and tumor) remain preserved – as they would be in a Constrained Marginal Model; however, we exhaustively consider every acceptable permutation of the data. Because our approach is highly-constrained, these permutations are exhaustively computable (median 36 alternatives per mutation). Thus, resampling is unnecessary.
Our null model presumes that all mutations of type i, defined by a tri-nucleotide context and base change, arise with probability Migt within each gene g and tumor t. For each gene, we tally the total quantity of nonsynonymous mutations Nig and synonymous mutations Sig. Suppose selection enriches or depletes nonsynonymous mutations within a gene and tumor by a rate ωgt. The expected number of nonsynonymous and synonymous mutations within a particular tumor and gene are E[dN] = ωgt ∑i MigtNig and E[dS] = ∑i MigtSig in the absence of selective pressures on synonymous mutations. As with the main text, dN and dN(observed) are used interchangeably. Although Migt is unknown, dN/dS statistics attempt to infer selection nonetheless by noting that:
Note that ρAB =< A, B >/(‖A‖‖B‖) where is the Pearson product-moment correlation coefficient. When ρMN ≈ ρMS,
I.e. dN/dS is approximately equal to the selective pressures on nonsynonymous mutations when the accessible nonsynonymous and synonymous loci are properly accounted and when the correlation between mutational processes and nonsynonymous loci are roughly equivalent to the correlation between mutational processes and synonymous loci. Traditionally, this assumption was used to calculated dN/dS. To improve resolution of dN/dS, researchers have attempted to account for these correlations using sophisticated parametric models of Migt. An alternative statistical approach, however, is to treat these correlations as nuisance parameters.
Constrained Marginal Models permute observed data in all possible manners that preserve the underlying covariance structure of the data (e.g. ρMN, ρMS). In our particular case of this method, we note that by definition, . Thus:
Hence, by dividing the observed mutations by all permutations, we eliminate the covariance of mutational processes with available loci and, thus, measure ωgt directly for any particular gene-tumor combination without mutational bias.
Unfortunately, because of the log-sum inequality, mutational bias can arise once cohorts of genes and cohorts of tumor samples are binned. This problem is common to all dN/dS measures and is a consequence of the correlation of mutational biases with selection (i.e. < Migt, ωgt >) – not the correlation of mutational biases with one another, as these covariances are already accounted-for in a Constrained Marginal Model. For example, if tri-nucleotide biases covary linearly with gene-level biases, and are independent of tumor-level biases, then a parametric estimate of Migt may deconstruct Migt into Migt = f(i, g, t, ρig), where ρig is the covariation of tri-nucleotide mutational biases with gene-level biases. Nonetheless, < Migt, ωgt > ∝ < ρig, ωgt > will still be ignored. Indeed, this covariation of mutational processes with selective forces is the focus of our current study: selection and genome-wide mutation rate are correlated (i.e. ∑t Migtωgt ≠ 0) because of Hill-Robertson Interference. Hence, the level at which observed dN values dS are binned necessarily ignores covariation between mutational processes and selection (in addition to any variation of ωgt within the bin). Another example of this binning challenge arises when positive and negative selection act on different regions of the same gene, which gene-level dN/dS binning can misinterpret as neutral evolution.
Validation of nonparametric null model
To confirm that our null model can accurately estimate dN/dS even in the presence of extreme tri-nucleotide mutational biases, we simulated artificial data where different COSMIC signatures15 (SBS Signatures 1-9, v3) contribute to all of the mutations. Permuted dN and dS tallies for each mutational context were simulated by randomly sampling 1,000 genes with the same mutational context. The fraction of permuted dN and dS tallies for each mutational context was used as weighted probabilities to derive observed dN and dS tallies. To simulate negative selection, dN counts were randomly removed from each context at a rate 1 - ωgt (e.g. a simulated ‘true’ dN/dS of 0.8 in a cohort of samples indicates a 20% chance of nonsynonymous mutations being removed in the samples). These simulated (true) rates were then compared to observed and permuted dN and dS tallies according to the dN/dS metric that we used throughout this study:
We confirmed that this approach accurately measures selection in the presence of simulated mutational biases (Fig. S2)
The number of permutations available for each gene/tri-nucleotide combination declines with gene length. Ultra-short genes may be too constrained for our permutation approach and underestimate selective pressures. While 12% of genes in our study harbored fewer than 10 permutations per mutation, these genes contained only ~ 3% of all mutations, as these genes are exceptionally short. Exclusion of these genes did not appreciably alter observed dN/dS patterns (Fig. S3E).
Mutations can be permuted across every identical tri-nucleotide context within a particular gene or every identical tri-nucleotide context within a particular transcript. For differentially-spliced genes, transcript and gene annotations differ: transcripts are comprised of a subset of exons that define the whole gene. Hence, WES data directly sequences transcripts, which can be overlaid along the genome to infer genes. Because transcript annotations directly match WES data, which comprises 85% of available samples, we chose to constrain permutations at the transcript level (ENST) rather than the gene level (ENSG or Hugo Symbols)66. This choice does not appreciably affect dN/dS patterns (Fig. S25), however there is a slight universal shift towards a dN/dS rate of 1 (in both drivers and passengers) when permuting at the gene level. Presumably, this is because exons exclusive to rare splicing variants experience weaker selective pressures (and/or less transcription-coupled DNA repair.) The subtle differences between gene-level and transcript-level null models may explain the subtle difference in genome-wide dN/dS levels between our approach and the dNdScv model1 (Fig. S3C).
Lastly, we note that binning nonsynonymous and synonymous mutations at the genome-wide level (e.g. drivers and passengers) provided the most robust estimates of dN/dS when bootstrapping observed tumor samples. Statistical power is insufficient when binning at the individual gene level. Bootstrapping also demonstrated that log transformation of dN/dS values increases statistical power, and thus was generally applied to dN/dS analyses in this study.
A Parametric Null Model of Mutagenesis
For comparison, we also calculated dN/dS using dNdScv67 – a previously-published parametric null model of mutagenesis in cancer1. To compare both methods, dNdScv was ran globally and separately on samples stratified by the total number of substitutions using the following parameters:
max_coding_muts_per_sample = Inf max_muts_per_gene_per_sample = InfGlobal dN/dS values of all nonsynonymous mutations (wall, reported by dNdScv) were used. This model reproduced our nonparametric dN/dS trends (Fig. S3) and was used to infer patterns of selection in synonymous mutations (Fig. S19). We note that stratifying tumors in TCGA into 20 bins of equal sample-size (as was done in 1), rather than evenly-spaced bins, averages-out a significant proportion of the negative selection observed in passengers, since low mutation burden tumors reside within the tail-end of the distribution (Fig. S10).
Orthogonality of dN/dS with Mutational Burden and effects of excluding samples with no synonymous mutations
Mutational burden is generally calculated as the total number of substitutions within a sample (i.e. dN + dS), however these tallies are also used in our measurement of dN/dS. Hence, any interdependence of mutational burden with dN/dS could bias our understanding of the relationship between selection and genome-wide mutation rate. We consider the interdependence of these two measures by assuming that both dN and dS are Poisson-distributed with rate parameters λN and λS. The joint probability mass density of any combination of these two quantities is then:
Here, r = λN / λS. The expectation value of dN/dS, for any degree of selection versus any combination of nonsynonymous and synonymous mutation tallies can then be calculated simply by exhaustively summing over all combinations that arise with probability above machine precision. In Figure S5, we compare the variation in dN/dS for a typical genome under neutral selection or equally-balanced positive and negative selection (r = 2.8) using the dN + dS and dS mutational burden metrics. We observe less deviation from expectation using dN + dS primarily because dS alone is a poor proxy for the mutation rate — i.e. there are far fewer synonymous mutations to use to estimate the mutation rate. dN + dS did exhibit slightly greater bias in observed dN/dS relative to expectation, however this bias was small compared to the variation in estimates (<5% for mutational burdens greater than 2) and biased observed estimates towards increased values of dN/dS, which will only understate the degree of negative selection. Lastly, we note that because the genome-wide dN/dS is approximately 1, deviations from these theoretical calculations should be minimal.
We also tested the effects of this non-orthogonality of our approach in three additional ways. First, we investigated the correlation of mutational burden metrics mutation rate in our simulated tumors (see below) and found that dN + dS correlated most strongly with mutation rate (Fig. S5C). Next, we randomly-partitioned all protein-coding mutations into two necessarily-orthogonal halves: a half that defined the mutational burden and a half that was used for calculating dN/dS. This partitioning found that selection patterns persisted (Fig. S5B). Finally using WGS data, we compared dN/dS to measures of mutational burden that excluded data from protein-coding regions (all intergenic and all intronic mutations), which once again represents a completely-orthogonal comparison of dN/dS with mutational burden (Fig. S3).
Identification of driver genes in cancer
For all analysis using SNVs, unless explicitly stated, a comprehensive list of 299 pan-cancer driver genes derived from 26 computational tools was used to catalog driver genes16. Other pan-cancer driver gene sets tested were derived from COSMIC’s Driver Gene Census15 (downloaded on October 2016) and IntOGen’s Cancer Drivers Database17 (v2014.12) which contained 602 and 459 number of driver genes, respectively.
Many driver genes are associated with only particular tumor subtypes. To compare patterns of selection across cancer subtypes without increasing or decreasing the size of the list for each subtype, we chose to use a single set of driver genes for most analyses. This may understate the degree of positive selection in driver genes as mutations in these genes may be passengers in some tumor subtypes. In Fig. S8, we investigate patterns of selection using the top 100 driver genes identified for each tumor type and observe decreased signatures of positive selection overall in driver genes. Nevertheless, the patterns of attenuated selection in drivers and passengers remains. While tissue-type specific driver genes certainly exist, our results suggest that our statistical power to detect drivers still remains too limited to justify subdividing analyses by tumor type in many cases.
For all CNA analysis, GISTIC 2.068 was used to identify a set of genomic regions enriched for copy number gains and copy number losses using recommended settings with a confidence threshold of 0.9. CNAs used to identify these peaks were downloaded from the NIH Genomic Data Commons (GDC)27 in the TCGA cohort. For each amplification peak, the closest gene was annotated as a putative Oncogene, and similarly the closest gene to each deletion peak was annotated as a putative Tumor Suppressor. The top 100 amplification peaks (oncogenes) and deletion peaks (Tumor Suppressors) were classified as drivers for each of the 32 tumor types. 34% of identified driver genes appear in more than one tumor type, while 2.6% of identified driver genes appear in more than five tumor types.
For both SNV and CNA analysis, passengers were defined as mutations that did not reside within driver genes. The vast majority of mutations are passengers, and their relative totals for both SNVs and CNAs are depicted in Fig. S24.
Annotation of clonal and subclonal mutations
Since TCGA contains SNVs with high coverage and available purity estimates, only MC3 SNVs (exclusive to TCGA) were used in this analysis (WGS read-depth is generally lower than WES read-depth). Variant allele frequencies (VAFs) were calculated per site as the number of mutant read counts divided by the total number of read counts. VAFs were adjusted for purity using calls made by ABSOLUTE27,69, collected from GDC. A VAF threshold of 0.2 was used to define ‘subclonal’ (< 0.2) vs ‘clonal’ (> 0.2) SNVs. Different VAF thresholds were considered (Fig. S12) and the choice of ‘clonal’ thresholding did not impact the conclusions of this study.
Polyphen2 analysis
PolyPhen2 annotations in the MC3 SNP calls were used18. Only missense mutations that were categorized as either ‘benign’, ‘probably damaging’ or ‘possibly damaging’ were used. The fraction of pathogenic missense mutations was calculated as the number of pathogenic mutations categorized as either “probably damaging” or “possibly damaging” divided by the total number of categorized mutations.
Classification of genes by functional category
To test for patterns of selection in functionally related genes, we annotated all mutations by different functional categories and Gene Ontology (GO) terms26. Oncogenes and tumor suppressors were annotated from a curated set of 99 high confidence cancer genes70. Essential genes were collected from a genome-wide CRISPR screen that identified genes required for proliferation and survival in a human cancer cell line71. Housekeeping genes were defined as genes with an exon that is expressed in all tissues at any nonzero level, and exhibits a uniform expression level across tissues72. Interacting proteins were downloaded from the mentha database in April 201973.
To identify highly expressed genes, median transcripts per million (TPM) in 54 tissue types (v7 release) were downloaded from the Genotype-Tissue Expression (GTEx) project45. Tissues that contained high expression in most genes, specifically testes, were removed. Only genes that had TPM counts above zero in any of the 53 remaining tissues were used. TPM counts were averaged across all tissues. Highly expressed genes were defined as the top 1000 genes expressed across all tissues.
To test for signals of negative selection in other functional groups, we annotated mutations by candidate GO terms according to Biological Processes: Transcription Regulation (GO Term ID: 0140110), Translation Regulation (GO Term ID: 0045182), and Chromosome Segregation (GO Term ID: 0007059).
Somatic Copy Number Alteration (CNAs)
All CNAs were downloaded from the COSMIC database on June 201515. Mitochondrial CNAs were discarded from analysis, as copy number changes are difficult to infer. Gene annotations and the locations of telomeres and centromeres were downloaded from the UCSC Genome Browser (hg19). Telomeric and centromeric regions were masked from all measurements of dE/dI. Because the selection patterns of non-focal CNAs — alterations with at least one terminus in a telomere or centromeric region — were not noticeably different from long (>100kb) focal CNAs, these two alteration classes were aggregated for analysis. Notably, we observed positive selection for both amplifications and deletions within oncogenes, and for both deletions and amplifications within Tumor Suppressors. For this reason, we did not distinguish between gains and losses, nor oncogenes and Tumor Suppressors in published analyses: any CNA that overlapped an oncogene or tumor suppressor in any region (for any fraction of the CNA) was classified as a driver. Mutational burden was defined simply as the total number of CNAs within a sample. Pan-cancer CNAs from cBioPortal (August 2018) were also analyzed, however consistent purity and ploidy estimates could not be obtained by using either ABSOLUTE69 or TITAN74, so this data was not used for published analyses of CNAs.
Measurements of selection on CNAs
dE/dI was calculated using a ‘Breakpoint Frequency’ metric and a ‘Fractional Overlap’ metric. For both metrics, the dE/dI of a particular gene set i (e.g. driver or passenger genes) is defined by a genomic track Ti,g, which is one for every annotated region g of the track and zero elsewhere. Only non-centromeric and non-telomeric regions are considered in the mappable human genome G. Each CNA Cg,m is defined by its position on the genome g and the mutational burden m of the tumor harboring the mutation. For ‘Breakpoint Frequency’ Cm,i is one at the position of both termini of the CNA and zero elsewhere. For ‘Fractional Overlap’ Cm,i is 1/L, where L is the length of the CNA, for every region of the genome spanned by the CNA and zero elsewhere. For a particular range of mutational burdens M, dE/dI was defined as:
We note that calculation is accelerated by >100x by commuting Ti,g with the outer summation (∑mM). Lastly, we randomly permuted the start and stop positions of each CNA, while preserving its length, to derive a set of neutral CNAs not experiencing selection. This permutation analysis finds that dE/dI for both breakpoint frequency and fractional overlap is ~1 in the absence of selection (Fig. S23).
Tumor purity analysis in TCGA samples
Tumor purity estimates from the ABSOLUTE algorithm69 were downloaded from the GDC on May 2020. For all tumors and for tumors with <= 10 substitutions, correlation coefficients between the total number of substitutions and tumor purity were calculated. To evaluate the effects of tumor purity on patterns of selection, tumors below increasing thresholds of tumor purity were removed from the analysis, and dN/dS was calculated on tumors stratified by mutational burden bins (as described above.)
Expression analysis
Gene expression data was downloaded from the COSMIC database on September 2019. Genes used to identify different protein folding pathways were downloaded from 75, genes involved in protein degradation pathways were identified from 76. The median gene expression of all genes in each protein folding pathway was used. Patients were binned by the total number of substitutions (using MC3 SNP calls from TCGA) and CNAs, and the average gene expression of each bin was calculated.
Cancer subtype analysis
All tumor subtypes in TCGA and ICGC were grouped into 9 sub-categories, based on broad, predominantly anatomical features. Anatomical features (i.e. organ and systems of organs), rather than histological features or inferred cell-of-origin, were used as groupings because we believe that the fitness effects of mutations should be predominantly defined by the environment of the tumor. Nevertheless, we observed attenuated selection in both drivers and passengers in many broad histologically defined classifications (e.g. adenocarcinomas & sarcomas). For all cancer grouping analysis (broad and subtype), tumors were stratified into bins by the total number of substitutions (dN + dS) on a log scale. Since tumor subtypes vary in their range of mutational burdens, (e.g. KIRC cancer subtypes only have tumors with <100 substitutions), dN/dS values in the lowest and highest mutational burden bin for each cancer-subtype are shown.
Specific cancer subtype categories were taken directly from the NCI Genomic Data Commons (GDC)27. Because CNAs were downloaded from COSMIC, CNA datasets were not classified with this same ontology. Table S1 details how CNA classifications were mapped on GDC categories (and sometimes more broadly-defined groups). All subtypes with >200 samples were used in our CNA subtype analyses (Fig. S15).
An evolutionary model with Hill-Robertson Interference
Somatic cells in our populations are modeled as individual cells that can stochastically divide and die in a first-order (memoryless) Gillespie Algorithm. This model was developed and described previously33. During division, cells can acquire advantageous drivers with rate μTdrivers and deleterious passengers with rate μTpassengers – these values specify the mean of Poisson-distributed pseudo-random number (PRN) generators that prescribe the number of drivers and passengers conferred during division (e.g. the number of drivers per division nd = Poisson[nd = k; λ = μTdrivers] = λk e−k / k!). The Distribution of Fitness Effects (DFE) conferred by each driver and each passenger are Exponentially-distributed PRNs with probability densities P(si = x; sdrivers) = Exp[-x/sdrivers]/sdrivers and P(si = x; spassengers) = - Exp[-x/spassengers]/spassengers respectively. Simulations with other exponential-family DFEs do not qualitatively differ from these exponential distributions28. The aggregate absolute cellular fitness is in our Multiplicative Epistasis model and Δf = si/(1 + vf) with v = 1 in our Diminishing-Returns Epistasis Model where Δf is the change in cellular fitness with each mutation77. The rate of cell birth is inversely proportional to cellular fitness, while the rate of cell death increases with the population size of the tumor N.
With these birth and death processes, mean population size abides by a Gompertzian growth law in the absence of additional mutations, which is scaled by the mean cellular fitness E[N(< f >)] = Log[1 + < f > / N 0] (derived from Master Equation28). While, programmatically, mutations exclusively affect the birth rate and the constraints on growth exclusively affect the death rate, we previously demonstrated that birth and death rates are generally nearly-balanced such that dynamics are not affected by this design choice.
Because somatic cells do not recombine during cell division, dominance coefficients were not explicitly modeled. Thus in diploid cancers, our selection coefficients estimate the mean heterozygous effect of drivers and passenger (i.e. hs). Similarly, Loss of Heterozygosity (LOH) events (gene losses, gene conversions, mitotic recombination, etc) are not explicitly modeled either; however, these events can be viewed as additional mutations that may be either adaptive drivers or deleterious passengers in the model. As sequencing data improves, we believe that it will be informative to explicitly model dominance coefficients, tumor ploidy, and LOH events.
Simulations progressed until tumor extinction (N = 0 cells), malignant transformation (N = 106 cells), or until approximately 100 years had passed (18,500 generations). Only fixed mutations (present in the Most Recent Common Ancestor) within clinically-detectable growths were analyzed in our ABC pipeline. The behavior of this model has been described previously28,33 and the most relevant assumptions of this model and their effects on the conclusions of this study are described in Table S2.
Cells in our populations are fully described by their accrued mutations, and birth and death times. Birth and death events were modeled using an implementation of the Next Reaction78, a Gillespie Algorithm that orders events using a Heap Queue. Generation time in our model was defined as the inverse of the mean birth rate of the population: 1/ <B(d, p)>. While all mutation events occurred during cell division, if mutations were to occur per unit of time (rather than per generation), rapidly growing tumors would acquire drivers at a slightly slower rate as generation times decline over time. This effect, however, is negligible compared to the variation in waiting times conferred by the variation in mutation rates (division times merely double, while mutation rates vary by 100,000-fold).
This simple evolutionary model is defined by five parameters μTdrivers, μTpassengers, sdrivers, spassengers, and N0. The target size of drivers is defined as the approximate number of nonsynonymous mutations in the Bailey Driver Screen Tdrivers = (# of driver genes)•(mean driver length)•(fraction of SNVs that are nonsynonymous) = 300 genes • 1298 loci/gene • 0.737 nonsynonymous loci / loci = 286,886 nonsynonymous loci. The target size of passengers was simply the remaining loci in the protein coding genome, Tpassengers = 20,451,136 nonsynonymous loci. The mutation rate was constant throughout each tumor simulation and randomly-sampled from a uniform distribution in log-space that ranged from 10−12 to 10−7 mutations•loci− 1•generation−1. While tumors were initiated from this broad range, malignancies (N > 106 cells) were almost always restricted to mutation rates between 10−10 and 10−8 (Fig. S20), as tumors with mutation rates drawn below this range almost never progressed to cancer within 100 years and tumors with mutation rates drawn above this range went extinct through natural selection.
The likelihood that tumors progress to cancer in the presence of deleterious passengers depends heavily on the initial population size N0 of the tumor. This dependence was studied previously33, where it was demonstrated that reasonable evolutionary simulations (those that progress to cancer >10% of the time, but less than 90% of the time) are restricted to a four-dimensional manifold N* within the five-dimensional phase space of parameters. For this reason, N0 = N*(sdrivers, spassengers, μTdrivers, μTpassengers) was determined by the other four parameters. To first-order, this manifold is Tpassengers spaassengers / (Tdrivers sdrivers2), however a more precise estimate (Eq. S8 of 33) incorporating more precise estimates of Muller’s Ratchet and the effects of hitchhiking on both driver and passenger accumulation rates, which does not exist in closed form was used. Additionally, at very low values of sdrivers, progression to cancer is limited by time, not by the accumulation of deleterious passengers. Hence, we assigned N0 such that:
Here, Pcancer and tcancer – the likelihood and waiting-time to cancer – are defined by equations S8 and S12 respectively in 33. N0 was determined from these equations using Brent’s Method. Supplementary Figure 17 depicts the values of N0, which ranged from 1 to 100 for all simulations.
In tumors that progress to malignancy (N = 106), only fixed nonsynonymous mutations (present in all simulated cells) were recorded. We also recorded (i) the fitness effect of these mutations, (ii) the mean population fitness, (iii) the number of generations until malignancy, and (iv) the mutation rate. These two values were used to generate the number of synonymous drivers and passengers, where P(dS = k) = Poisson[k; λ = μTdrivers/passengers /r tMRCA] defines the number of synonymous drivers/passengers conferred, tMRCA represents the number of division until the Most Recent Common Ancestor arose in the simulation, r = 2.795 represents the ratio of nonsynonymous to synonymous loci within the genome, weighted by the genome-wide trinucleotide somatic mutation rate, and the Poisson PRN generator was defined above. In simulations where synonymous drivers could arise, a fraction of the recorded nonsynonymous mutations (ranging from 0 – 20%) were simply re-labeled as synonymous drivers (as opposed to nonsynonymous drivers). This was done, again, by Poisson-sampling in proportion to the desired fraction for each cancer simulation.
20 × 20 combinations of sdrivers and spassengers parameters were simulated (Fig. S16 & S17). Simulations were repeated until 10,000 cancers at each parameter combination were obtained or until 10 million tumor populations were simulated. While we attempted to initiate tumors at a population size where the probability of progression to cancer was 50%, some parameter combinations still did not yield 10,000 cancers after 10 million attempts (i.e. Pcancer < 0.1%). These combinations were predominately at low values of sdrivers, which were far from the MLE estimate of sdrivers and represent unrealistic evolutionary scenarios: drivers cannot be weakly beneficial, relegated to only 300 genes, and still overcome deleterious passengers within 100 years. These simulations are annotated as “Progression Impossible.” Simulation parameter sweeps were performed for both the Multiplicative and Diminishing Returns Epistasis models. Twenty fractions of synonymous drivers were also generated (ranging from 0% to 20%). These fractions were generated by simply re-labeling the driver mutations which conferred fitness (generated during the simulation) as synonymous, instead of nonsynonymous.
Summary statistics of simulated and observed tumors
For both simulated and observed data, we summarized dN/dS rates versus mutational burden for drivers and for passengers by decade-sized bins: (0, 10], (10, 100], (100, 1,000]. Mutational burden for simulations was defined as the total number of substitutions (dN + dS) – exactly as it was defined for observed data. For simulated data, dN/dS = dN/(dS • r). Like observed data, dN/dS rates attenuated towards 1 for both drivers and passengers for all values of sdrivers and spassengers.
Mutational Burdens (MB) for simulated and observed data were summarized with the parameters of a Negative Binomial distribution, where . This distribution has been used previously to summarize the mutational burdens of human tumors 79 and exactly defines the expected number of mutations at transformation in a Multi-Stage Model of Tumorigenesis30 when n drivers are needed for transformation and the probability that any mutation be a driver is 1 – p 80. Both n and p were used to summarize MB. These quantities were determined by Maximum Likelihood optimization of the probability mass function above over the support of mutational burdens of [1, 1,000] substitutions. The Han-Powell quasi-Newton Least-squares method was used for optimization.
Age-dependent Cancer Incidence rates (CI) were summarized with the parameters of a Gamma distribution, where . Here, is the lower incomplete gamma function and Γ(k) = γ(k, ∞) is the regular gamma function. Similar to our summarization of mutational burdens, this distribution is a generalization of the exact waiting time to transformation expected from a Multi-Stage Model of Tumorigenesis when tumors arise at a uniform rate over time, require k drivers for transformation, and wait an average time of θ between drivers 80. This Cumulative Distribution Function was fit to observed incidence rates for all patients above 20 years of age using the least squares numerical optimization defined above (All cancer sites combined, both sexes, all races, 2012 – 2016 81). Patients under 20 years of age were excluded because cancers in these patients generally arise from germline predispositions to cancer, which are (i) not directly modeled by our simulations, (ii) not detected as somatic mutations, and (iii) result in age-incidence curves that do not agree with a Gamma distribution30. Because all cancer simulations are initiated at t = 0 (instead of uniformly in time, as is presumed in the Multi-Stage Model), the simulated data was fit using the probability density function of this distribution (instantaneous derivative) using Maximum Likelihood and the optimization algorithm described above. The cumulative distribution, then, represents the expected age-incidence cancer incidence rate when simulations begin at uniformly-distributed moments in time and, thus, was used to generate Figure 3D. Only the shape parameter k was used in ABC (and θ was ignored), as this parameter only specifies the dimensionality of time (simulation time was measured in cellular generations, not years) and all values of θ in our simulations are equivalent under a Gauge transformation. Additionally, we do not expect the exact times of incidence to be particularly informative as the time of transformation is generally somewhat earlier than the time of detection.
Use of Approximate Bayesian Criterion (ABC) for model selection and parameter inference
Like many Bayesian analyses, the main steps of an ABC analysis scheme are: (1) formulate a model, (2) fit the model to data (parameter estimation), and (3) improve the model by checking its fit (posterior-predictive checks) and (4) comparing this model to other models 82,83.
The nine summary statistics described above were used to compare simulations to observed data. Agreement was summarized with a Log-Euclidian distance, as all summary statistics resided on the domain [0, ∞) and log-transformation of the summary statistics minimized heteroscedasticity of the simulated data relative to a square-root or no transformation. Variance of the summary statistics was not normalized. ABC was performed using the `abc` R package82.
The rejection method (Feedforward Neural Net) and tolerance (0.5) were chosen based on their capacity to minimize prediction error of the simulated data using Leave-one-out Cross Validation (CV, Fig. S18A). 10,000 instances of the neural network, which was restricted to a single layer, were initiated and the median prediction of these networks were used. These parameters were used for both model comparison and parameter inference. The posterior model probability (postpr) was used to compare the two epistatic models (Diminishing Returns versus Multiplicative). The likelihood of the data under the Diminishing Returns model (14%) was less than the likelihood under the Multiplicative Epistasis Model (86%).For parameter inferencing, the sdrivers and spassengers prior values were log-transformed.
For the synonymous driver model, the base model (without synonymous drivers) was simply the lowest quantity of synonymous drivers (0%) in the parameter sweep of synonymous driver quantities (Fig. S18B). The posterior probability mass of this value 0.043 was used as the one-sided p-value for the null hypothesis that these two models are equally predictive. Although the synonymous driver model agreed with the observed data slightly-better, sdrivers and spassengers parameters could not be inferred from the data because the potential for synonymous drivers destroys the utility of a dN/dS statistics, which is predicated on the notion that synonymous mutations are neutral. Virtually any value of dN/dS is attainable when the right combinations of selective pressures on nonsynonymous and synonymous are paired (Fig. S18C).
Supplementary Figures
Supplementary Tables
Acknowledgements
We thank Judith Frydman for her contribution on the heat shock response analysis, Monte Winslow for his contribution on cancer subtype analysis, Donate Weghorn for her contribution on the interdependence of dN/dS and mutational burden, Leonid Mirny, Grant Kinsler, Gabor Boross, Chuan Li, Alison Feder, Eliot Cowan and other members of the Petrov and Curtis labs for helpful comments and discussions. This work is supported by NIH grants T32-HG000044-21, E25-CA180993; the Director’s Pioneer Award DP1-CA238296 to C.C.; R01-CA207133, R35-GM118165, and R01-CA231253 to D.A.P.; and K99-CA226506 to C.D.M.
Footnotes
↵* dpetrov{at}stanford.edu
The effect of deleterious load on protein misfolding was investigated and included as part of the thesis. The effects of tumor purity were investigated, as well. The manuscript was compared more explicitly and directly to Martincorena et al (2017).
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.
- 50.
- 51.
- 52.
- 53.
- 54.
- 55.↵
- 56.↵
- 57.
- 58.
- 59.
- 60.
- 61.
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.
- 87.
- 88.
- 89.
- 90.
- 91.
- 92.