Summary
Cancer develops through a process of somatic evolution. Here, we reconstruct the evolutionary history of 2,778 tumour samples from 2,658 donors spanning 39 cancer types. Characteristic copy number gains, such as trisomy 7 in glioblastoma or isochromosome 17q in medulloblastoma, are found amongst the earliest events in tumour evolution. The early phases of oncogenesis are driven by point mutations in a restricted set of cancer genes, often including biallelic inactivation of tumour suppressors. By contrast, increased genomic instability, a more than three-fold diversification of driver genes, and an acceleration of mutational processes are features of later stages. Clock-like mutations yield estimates for whole genome duplications and subclonal diversification in chronological time. Our results suggest that driver mutations often precede diagnosis by many years, and in some cases decades. Taken together, these data reveal common and divergent trajectories of cancer evolution, pivotal for understanding tumour biology and guiding early cancer detection.
Introduction
Cancer arises through natural selection: initiated by mutations in a single cell, the accumulation of subsequent aberrations and the effects of selection over time result in the clonal expansions of cells, ultimately leading to the formation of a genomically aberrant tumour1. This model has been underpinned by genetic studies, starting with classical work on retinoblastoma2 and the sequence of APC, KRAS and TP53 mutations during colorectal adenoma to adenocarcinoma progression3. Establishing a particular order of mutations during the somatic evolution of cancers systematically across cancer types, however, has proven to be complicated due to small sample sizes and the stochastic nature of evolution between individuals.
Deep sequencing of bulk tumour samples makes it possible to examine the evolutionary history of individual tumours, based on the catalogue of somatic mutations they have accumulated4. Many studies have reconstructed the phylogenetic relationships between tumour samples and metastases from individual patients5-8, corroborating the clonal evolution model. From single samples, the timing of chromosomal gains can be estimated using point mutations within duplicated regions9,10. In addition, the relative ordering of events within a tumour type can be determined by aggregating pairwise timing estimates of genomic changes (for example clonal vs. subclonal) across many samples using preference models11,12. While these approaches provide insights into tumour development, they have only been applied to a limited number of cancers.
Here, we use the Pan-Cancer Analysis of Whole Genomes (PCAWG)13 dataset, as part of the International Cancer Genome Consortium (ICGC)14 and The Cancer Genome Atlas (TCGA)15 to characterise the evolutionary history of 2,778 cancers from 2,658 unique donors across 39 cancer types. We determine the order and timing of mutations in cancer development to delineate the patterns of chromosomal evolution across and within different cancer types. We then define broad periods of tumour evolution and examine how drivers and mutational signatures vary between these stages. Finally, using CpG>TpG mutations, we convert timing estimates into approximate real time, and create typical timelines of tumour evolution.
Results
Reconstructing the life history of single tumours
A cancer cell’s genome is the cumulative result of the somatic aberrations that have arisen during its evolutionary past, and part of this history can be reconstructed from deep whole genome sequencing data (Fig. 1a)4. Initially, each point mutation occurs on a single chromosome in a single cell. If that chromosomal locus is subsequently duplicated, the point mutation will be co-amplified with the gained allele, which can be detected in deep sequencing data. Likewise, mutations found in a subset of tumour cells have not swept through the population, and must have occurred after most recent common ancestor (MRCA) of the tumour cells in the sequenced sample.
Mapping point mutations to the proportion of cells and chromosomes enables us to define three categories, which we term early clonal, late clonal and subclonal, each associated with broad epochs of tumour evolution (Fig. 1a). Clonal mutations have occurred before the occurrence of the MRCA and are common to all cancer cells. These can often be further subdivided as either early clonal if they occurred before copy number gains, or late clonal otherwise. Additionally, subclonal mutations are only observed in a fraction of cancer cells. Importantly, the number of early (and late) clonal mutations provides information about the timing of the underlying copy number segment. For example, there would be few, if any, coamplified early clonal mutations if the gain had occurred right after fertilisation (Fig. 1a and Online Methods)9.
These analyses are illustrated in Fig. 1b. As expected, the frequency of somatic point mutations cluster tightly around the values imposed by the purity of the sample, local copy number configuration and identified subclones. As the sample pictured has undergone whole genome duplication (WGD), the mutation time estimates of all copy number segments scatter narrowly around a single time-point, independently of the exact copy number state, confirming that WGD is a single catastrophic event.
Timing patterns of copy number gains
To systematically explore the timing of copy number gains pan-cancer, we applied mutational timing analysis to all 2,778 samples from 2,658 distinct donors across the PCAWG dataset (see Supplementary Methods). We find that chromosomal gains are typically acquired during the second half of clonal evolution (median value 0.76, IQR = 0.43–0.94), with systematic differences between tumour types (Fig. 2a, Supplementary Fig. 1). In glioblastoma, medulloblastoma and pancreatic neuroendocrine cancers, a substantial fraction of gains occurs early in mutational time. Conversely, in lung cancers and melanomas, gains arise towards the end of the mutational time scale. Most tumour types, including breast, ovarian and colorectal cancer, show relatively broad periods of chromosomal instability, rather than staggered events throughout clonal evolution.
There are, however, certain tumour types with consistently early gains of specific chromosomal regions. Most pronounced is glioblastoma, where single copy gains of chromosomes 7, 19 and/or 20 are present in 90% of tumours (Fig. 2a-b). Strikingly, these gains are consistently timed within the first 10% of clonal mutational time. Similarly, the duplications leading to isochromome 17q in medulloblastoma are timed exceptionally early. Although less pronounced, gains of chromosome 18 in B-cell non-Hodgkin lymphoma, as well as gains of the q arm of chromosome 5 in clear cell renal cell carcinoma, often have a distinctively early timing within the first 50% of mutational time.
We observed that co-occurring gains in the same tumour often appear to occur at a similar time, pointing towards punctuated bursts of copy number gains involving the majority of gained segments (Fig. 2c). While this is expected in tumours with WGD (Fig. 1b), it may seem surprising to observe synchronous gains (defined as more than 80% of gained segments in a single event) in near-diploid tumours. Still, synchronous gains are frequent, occurring in a striking 58% (469/814) of informative near-diploid tumours, 61% more frequently than expected by chance (p < 0.01, permutation test; Fig. 2d). These data indicate that tumour evolution is often driven in short bursts involving multiple chromosomes, confirming earlier observations in breast cancer16.
Timing of mutations in driver genes
As outlined above, point mutations can be qualitatively assigned to different time categories, allowing the timing of driver mutations (Fig. 1a, 3a). Using a panel of 453 cancer driver genes17, we find that the timing distribution of pathogenic mutations in the 50 most common drivers is predominantly clonal, and often early clonal (Fig. 3a-b). For example, TP53 and KRAS are 5-9x more likely to be mutated in the early than in the late clonal stage. For TP53, this trend is independent of tumour type (Fig. 3c). Mutations in PIK3CA are 4x more frequently clonal than subclonal, while non-coding changes near the TERT gene are 8x more frequently early clonal than expected. In contrast, SETD2 mutations are frequently subclonal, in agreement with previous reports5. Mutations in the non-coding RNA RMRP appear to be frequently late and subclonal.
Overall, common driver mutations predominantly occur early during tumour evolution. To understand how the entire landscape of all 453 driver genes changes over time, we calculated how the number of driver mutations relates to the number of driver genes in each of the evolutionary stages. This reveals an increasing diversity of driver genes mutated at later stages of tumour development: 50% of all early clonal driver mutations are found in only 12 different genes, whereas the corresponding proportion of late and subclonal mutations occur in approximately 39 and 36 different genes, respectively, a more than 3-fold increase (Fig. 3d). These results are consistent with previous findings in non-small-cell lung cancers18, and suggests that, across cancer types, the very early carcinogenic events occur in a constrained set of common drivers, while a more diverse array of drivers is involved in late tumour development.
Relative timing of driver gene mutations and recurrent copy number changes
Next, we integrated the timing of driver events and recurrent copy number changes across cancer samples to better understand the timing of their relative contributions to tumor evolution. We calculated an overall ranking of lesions, detailing whether each lesion occurs preferentially early or late during tumour evolution, by aggregating order relations between pairs of lesions from individual samples within each cancer type (Supplementary Methods, section 3.2, Supplementary Fig. 2).
In colorectal adenocarcinoma, for example, we find APC mutations to have the highest odds of occurring early, followed by KRAS, loss of 17p and TP53, and SMAD4 (Fig. 3e). Whole-genome duplications have an intermediate ranking, indicating a variable timing, while many chromosomal gains and losses are typically late. These results are in agreement with the classical progression of APC-KRAS-TP53 proposed by Vogelstein and Fearon3, but add considerable detail.
In other cancer types, the sequence of events in cancer progression has not previously been studied in as much detail as colorectal cancer. For example, in pancreatic neuroendocrine cancers, we find that many chromosomal losses, including those of chromosomes 2, 6, 11 and 16, occur early, followed by driver mutations in DAXX and MEN1 (Fig. 3f). WGD events occur late, after many of these tumours have reached a pseudo-haploid state due to wide-spread chromosomal losses. In glioblastoma, we find that beyond gains of chromosomes 7, 19 and 20 (as described above), loss of chromosome 10 and driver mutations in EGFR are also early (Fig. 3g). TERT promoter mutations tend to occur at early to intermediate time points, while other driver mutations, particularly RB1 and PTEN mutations, tend to be later events.
Across cancer types, we typically find TP53 mutations early, as well as losses of chromosome 17 (Supplementary Fig. 1). WGD events usually have an intermediate ranking and the majority of copy number changes occur after WGD. We also find that losses typically precede gains, and consistent with the results above, we find that common drivers typically occur earlier than rare drivers.
Timing of mutational signatures
Mutagenic processes acting on the tumour genome often leave characteristic signatures of their activity19,20. In order to quantify how these processes change over time, we estimated the intensity of active signatures within each sample, across the qualitative epochs of tumour evolution (early clonal, late clonal and subclonal). The changes in proportion of mutations associated with a given signature in each of these epochs provide a measure of the dynamics of relative signature activity (Fig. 4, Supplementary Fig. 3).
Overall, we find that signature activities typically change during clonal evolution by less than 30% (median fold change 0.98, IQR [0.70-1.36]), indicating that mutational processes act at a rather constant rate during tumour progression. This is in contrast with the variation of signatures across patients, which varies 10 to 100-fold. There are, however, particular signatures that show consistent trends over time, both pan-cancer and within certain tumour types (Fig. 4). For example, the relative activity of the mutational signature associated with DNA damage caused by tobacco smoking (signature 4) decreases at least 1.2-fold in 70% of cancers where it is active clonally, consistent with previous reports in lung adenocarcinoma21,22.
Other signatures, including UV light (signature 7) in melanoma (40% of samples with clonally active signature), and signature 12, of unknown aetiology, in liver cancer (83% of samples) show a similar ≥1.2-fold decrease in activity towards the later stages of clonal evolution (Fig. 4). We also observe that some signatures increase in late clonal evolution, most notably signatures 2 and 13, which are associated with the activity of APOBEC enzymes and increase by more than 1.2-fold in 58% of samples that have this signature. Similarly, the signature associated with BRCA mutations and defective double strand break repair (signature 3) increases in late clonal evolution in 35% of the samples where it is active. Similar trends also hold between clonal and subclonal phases of tumour evolution (Supplementary Fig. 3).
Chronological time estimates of whole genome duplications and subclonal diversification
Any changes in the mutation rate of cancers influence timing estimates made from mutational data. Due to increased proliferation and in some cases acquired hypermutation, one would generally expect an increase in the mutation rate (per year) in cancer, yet some mutational processes appear more variable than others.
The above analysis of signature changes revealed that the relative contribution of signature 1 usually decreases as other mutational processes become more active (Fig. 4). Mutational signature 1, characterised by CpG>TpG mutations, is a promising candidate for a clock-like process, as it is ubiquitously active in all tissues and has been described as correlating with age in normal tissues23,24 and multiple tumour types25. The latter implies not only that it is fairly constant in a given cell lineage, but also that it varies little across patients. For the purpose of timing mutations in chronological time, only the former property is required, as the age at diagnosis provides a reference by which relative timing estimates are scaled.
The acceleration of overall mutation rate and CpG>TpG rate can be directly estimated from sequencing data of matched primary and relapse samples from the same donor by comparing the rates of mutations that have accumulated between fertilisation and primary diagnosis to those accumulated between diagnosis and relapse. Suitable samples are publicly available for ovarian cancer26, breast cancer27 and acute myeloid leukaemia28. While for all point mutations, the median acceleration ranges between 3.3 for AML and 11.7 for ovarian cancer, CpG>TpG mutations display lower values and less variability (ranging from 2.8 to 6.7; Fig. 5a). To some extent this acceleration may be driven by treatment, but we may use it as a conservative reference for other tumour types.
Accounting for the acceleration above, we inferred the chronological time of whole-genome duplications based on CpG>TpG mutations (Supplementary Methods, section 5; Fig. 5b). While the typical timing of WGD is about one decade before diagnosis (assuming a 5x CpG>TpG mutation acceleration), we observe substantial variability among samples of a given tumour type, with many cases dating back more than two decades. Ovarian adenocarcinoma shows very early occurrences of WGD with approximately half of the samples having WGD more than two decades before diagnosis (Fig. 5b). A similar phenomenon is seen for breast adenocarcinoma. Without any acceleration, the estimated median occurrence of WGD would be 15- 25yrs for the majority of cancer types; this value decreases with greater values of CpG>TpG acceleration (Fig. 5c).
We used a similar approach to calculate the timing of the emergence of the MRCA, and therefore the onset of subclonal diversification. The typical timing is considerably closer to diagnosis although, interestingly, there are also cases dating back more than ten years before diagnosis (Fig. 5d). We note, however, that timing the occurrence of the MRCA is more difficult, as it is not always possible to calculate the phylogenetic relationship between subclones. The MRCA may date back longer if subclones arise sequentially.
While the exact timing of individual samples remains challenging due to low mutation numbers and unknown mutation rates for individual tumours, on average, a picture emerges where across tumour types, the median MRCA ranges between six months and six years before diagnosis, while WGD typically occurs 2-11 years before diagnosis (Fig. 5e). These findings dovetail with epidemiological observations: cancer generally arises past the age of 5029, and the typical latency between carcinogen exposure and cancer detection, most notable in tobacco-associated cancers, is several years to multiple decades30. Furthermore the progression of most known precancerous lesions to carcinomas occurs usually over multiple years, if not decades31-38. The data presented here corroborate that these time scales hold also in cases without detectable premalignant conditions, raising hopes that these tumours could also be detected in precancerous stages.
Discussion
Taken together, these analyses begin to build an overall picture of tumour development. Across cancer types, early tumour development is characterised by mutations in a handful of canonical driver genes, and biallelic inactivation of tumour suppressor genes, such as TP53. Copy number gains during this time are relatively infrequent in many tumour types, but can be distinctive in others. Throughout the later stages of tumour evolution, increased genetic instability, a greater diversity of drivers, and an acceleration of mutational processes shape the final subclonal diversification.
Our combined approaches allow us to draw timelines of tumour development over different cancer types (Fig. 6; Supplementary Fig. 1). We see that many years before a tumour is diagnosed, endogenous and exogenous mutational processes have resulted in key driver mutations and chromosomal instability. An intriguing finding is that large somatic events, such as WGD, can occur decades before the appearance and diagnosis of a tumour. Thus, the process of tumour development may span an entire lifetime.
Our findings raise the possibility of early detection, if cells carrying early mutations can be detected and distinguished from cells not progressing further. The discovery of distinctive, early mutations in certain tumour types, such as gains of chromosomes 7, losses of chromosome 10, and EGFR mutations in glioblastoma, and isochromosome 17q in medulloblastoma, begin to unveil possible candidate lesions.
Individual tumour types show characteristic sets of evolutionary trajectories, reflecting differences in the underlying biology of tumorigenesis (Fig. 6; Supplementary Fig. 1). Where applicable, these trajectories agree with previous studies of genomic aberrations acquired at different stages of tumour progression (e.g. in colorectal cancer3). Unlike most other cancers, high grade serous ovarian adenocarcinomas typically acquire chromosomal gains within the first half of clonal evolution (Fig. 6d). Our findings are consistent with these tumours being the most genomically unstable of all solid cancers39, and with their high frequency of TP53 and homologous recombination repair defects40. Both across and within cancer types, these typical evolutionary trajectories and their correlations with clinical features may provide an opportunity to develop prognostic markers and more effective therapies.
Our findings provide insight into the process of selection acting on tumours throughout their development. The genetic canalization in early tumour development, and increased diversity of driver mutations later in tumour evolution, is striking. It suggests a strong epistasis of fitness effects constraining evolution initially to a small set of mutational events that are able to initiate neoplastic transformation. Over time, as tumours evolve, the small- and large-scale somatic changes they subsequently accumulate propel them towards increasingly specialised developmental paths driven by individually rare, atypical driver mutations.
In summary, we present the first pan-cancer analysis of the evolutionary history of tumours. The timelines we derive from this analysis show that in a wide range of cancer types, tumour evolution often follows a typical pattern. This can begin decades before diagnosis, thus providing a window for early diagnosis and clinical intervention.
Methods
Timing of gains
We used three related approaches to calculate the timing of copy number gains (see Supplementary Methods, section 1). In brief, the common feature is that the expected variant allele frequency of a mutation is related to the underlying number of alleles carrying a mutation according to the formula
Here X is the number of reads, n denotes the coverage of the locus, the mutation copy number m is the number of alleles carrying the mutation (which is usually inferred), f is the frequency of the clone carrying the given mutation (f = 1 for clonal mutations). N is the normal copy number (2 on autosomes, 1 or 2 for chromosome X and 0 or 1 for chromosome Y), C the total copy number of the tumour and ρ the purity of the sample.
The number of mutations at each allelic copy number then informs about the time when the gain has occurred. The basic formulae for timing each gain are, depending on the copy number configuration:
Here 2+1 refers to major and minor copy number of 2 and 1, respectively. Methods differ slightly in how the number of mutations present on each allele are calculated and how uncertainty is handled.
Timing of mutations
The mutation copy number m and the clonal frequency f is calculated according to the principles indicated above. Details can be found in Supplementary Methods, section 1.2. Mutations with f = 1 are denotes as clonal, and mutations with f < 1 as subclonal. Mutations with f = 1 and m > 1 are denote as early clonal (coamplified). In cases with f = 1, m = 1 and C > 2 mutations were annotated as late clonal, if the minor copy number was 0, otherwise clonal [unspecified] (Supplementary Methods, section 1.2.)
Timing of driver mutations
A catalogue of driver point mutations was provided by PCAWG working group 217. The timing category was calculated as above. From the four timing categories, odds ratios of early/late clonal and clonal (early, late or unspecified clonal)/subclonal were calculated for driver mutations against the distribution of all other mutations in the samples with each particular driver. The background distribution of these odds ratios was assessed with 1000 bootstraps (Supplementary Methods, section 3.1.)
Integrative timing
For each pairs of driver point mutations and recurrent copy number variants it was established what the ordering of the given pair was (earlier, later or unspecified). The information underlying this decision was derived from the timing of each driver point mutation, as well as from the timing status of clonal and subclonal copy number segments. These tables were aggregated across all samples and a sports statistics model was employed to calculate the overall ranking of driver mutations. A full description is given in Supplementary Methods, section 3.2.
Timing of mutational signatures
Mutational trinucleotide substitution signatures, as defined by PCAWG working group 7,20 were refit to samples with observed signature activity, after splitting point mutations into either of the 4 timing categories. Time-resolved exposures were calculated using non-negative linear least squares. Full details are given in Supplementary Methods, section 4.
Real-time estimation of copy number gains
For tumours with multiple time points the set of mutations shared between diagnosis and relapse (nD) and those specific to the relapse (nR) was calculated. The rate acceleration was calculated as a = nR / nD × tD / tR. This analysis was performed separately for all substitutions and for C>T changes in a CpG context.
The correction for transforming an estimate of a copy number gain in mutation time into chronological time depends not only on the rate acceleration, but also on the time at which this acceleration occurred. As this is generally unknown we performed Monte Carlo simulations of rate accelerations spanning an interval of 0.66 to 1.0 of relative time and averaged the results. Subclonal mutations were assumed to occur at full acceleration. The proportion of subclonal mutations was divided by the number of identified subclones, thus conservatively assuming branching evolution. Full details are given in Supplementary Methods, section 5.
Supplementary Figure Legends
Supplementary Figure1. Summary of all results obtained per cancer type
(a) Clustered heatmaps of mutational timing estimates for gained segments, per patient. Colours as indicated in main text: green represents early clonal events, purple represents late clonal. (b) Relative ordering of copy-number events and driver mutations across all samples per cancer type. (c) Distribution of mutations across early clonal, late clonal and subclonal stages, for the most common driver genes per cancer type. A maximum of 10 driver genes are shown. (d) Clustered mutational signature fold changes between early clonal and late clonal stages, per patient. Green and purple indicate, respectively, a signature decrease and increase in late clonal from early clonal mutations. Inactive signatures are coloured white. (e) As in (d) but for clonal vs. subclonal stages. Blue indicates a signature decrease and red an increase in subclonal from clonal mutations. (f) Typical timeline of tumour development, per cancer type.
Supplementary Figure2. Correlation between league model and Bradley Terry model order of events.
The two approaches for determining the order of recurrent somatic mutations and copy number events are compared directly for each tumour type. We show how the order derived from the league model compares to that derived from the Bradley Terry model, quantified by Spearman' s rank correlation coefficient.
Supplementary Figure3. Timing of signatures
(a) Fold changes in signature exposures between clonal and subclonal stages in all tumours, sorted by the ratio of tumours with a positive signature change. The violin plot shows the distribution of changes in exposures across tumour types. (b) Heatmap of fold changes in signature exposures (clonal vs. subclonal). Within cancer types, tumours are ordered according to hierarchical clustering. White indicates inactive signatures.
Footnotes
↵# These authors jointly directed the work