ABSTRACT
Mutations in cancer-associated genes drive tumour outgrowth. However, the timing of driver mutations and dynamics of clonal expansion that lead to human cancers are largely unknown. We used 448,553 somatic mutations from whole-genome sequencing of 843 clonal haematopoietic colonies to reconstruct the phylogeny of haematopoiesis, from embryogenesis to clinical disease, in 10 patients with myeloproliferative neoplasms which are blood cancers more common in older age. JAK2V617F, the pathognomonic mutation in these cancers, was acquired in utero or childhood, with upper estimates of age of acquisition ranging between 4.1 months and 11.4 years across 5 patients. DNMT3A mutations, which are associated with age-related clonal haematopoiesis, were also acquired in utero or childhood, by 7.9 weeks of gestation to 7.8 years across 4 patients. Subsequent driver mutation acquisition was separated by decades. The mean latency between JAK2V617F acquisition and clinical presentation was 31 years (range 12-54 years). Rates of clonal expansion varied substantially (<10% to >200% expansion/year), were affected by additional driver mutations, and predicted latency to clinical presentation. Driver mutations and rates of expansion would have been detectable in blood one to four decades before clinical presentation. This study reveals how driver mutation acquisition very early in life with life-long growth and evolution drive adult blood cancer, providing opportunities for early detection and intervention, and a new paradigm for cancer development.
INTRODUCTION
Human cancers harbor hundreds to hundreds of thousands of somatically acquired DNA mutations. Whilst the majority of such mutations do not affect the cancer’s biology, a minority drive tumour initiation, growth and progression1. These so-called driver mutations occur in recurrently mutated cancer genes, and stimulate the cell acquiring it to expand into a clone. With a large enough clonal expansion, typically abetted by acquisition of further driver mutations, a cancer emerges. Little is known about what ages driver mutations occur, the timelines of clonal expansion over a patient’s lifetime, or how these relate to clinical presentation with overt cancer. Some mutational processes accrue at a constant rate across life, representing a ‘molecular clock’2,3. Knowing this tissue-specific rate of mutation accumulation, it has been possible to infer broad estimates for the timing of driver mutations for some cancers4,5.
In patients with blood cancers, the observation of normal blood counts months to years prior to diagnosis has led to the prediction that tumour development occurs quickly, and therefore driver mutations must occur late in life. Estimates from cancer incidences in Japanese atomic survivors who developed chronic myeloid leukaemia have suggested a mean latency time of only 8 years between BCR-ABL1 induction and clinical presentation6. However, the presence of driver mutations in normal tissues7–12, including blood from healthy individuals who harbor age-related clonal haematopoiesis (CH)13–17, some of whom subsequently develop malignancies, supports a longer multi-hit evolutionary trajectory of cancer. Understanding the absolute timelines of cancer evolution is critical for efforts aimed at early detection and intervention, especially if a given cancer takes decades to emerge after its first driver mutation.
Myeloproliferative neoplasms (MPN) are blood cancers driven by somatic driver mutations in haematopoietic stem cells (HSC) that result in increased mature myeloid cell production18. Most patients harbor JAK2V617F, and this can either be the only driver mutation or occur in co-operation with driver mutations in other genes such as DNMT3A or TET219. The clinical course often spans decades, with phenotypic disease progression occurring upon acquisition of additional driver mutations20. MPNs provide a unique opportunity to capture the earliest stages of tumourigenesis through to disease evolution which are otherwise inaccessible in other malignancies. Here, we undertake whole-genome sequencing (WGS) of individual single-cell derived haematopoietic colonies, and targeted resequencing of longitudinal blood samples from patients with MPN, to assess the absolute timing of driver mutations, tumour evolutionary dynamics, and the fundamental nature of driver mutation mediated clonal selection in vivo.
RESULTS
Using somatic mutations for haematopoietic lineage tracing in patients with MPN
The mutations present in a somatic cell’s genome have accumulated throughout its ancestral lineage, passed from mother to both daughter cells with each cell division. We identified somatic mutations in individual HSCs from patients with MPN, using them to reconstruct the lineage relationships among both malignant and normal blood cells in each patient21. Given that somatic mutation burden does not differ between HSCs and myeloid progenitors21,22, we undertook WGS of in-vitro expanded single-cell derived haematopoietic colonies as faithful surrogates for the genomes of their parental HSCs. We then ‘recaptured’ these somatic mutations in bulk peripheral blood cells using targeted sequencing in order to longitudinally track clones and infer population estimates (Fig.1a, b).
Our cohort comprised 10 patients with MPN diagnosed between ages 20 and 76 years (3 essential thrombocythemia (ET), 5 polycythemia vera (PV) and 2 post-PV myelofibrosis (MF)) (Fig.1b, Extended Table 1). We obtained 952 colonies from 15 timepoints spanning both diagnosis and disease course for WGS to a mean depth of ~14x (Fig.1b). Colonies with low sequencing coverage or evidence of non-clonality were excluded, and 843 were included in the final analyses (Extended Table 2). We identified 448,553 somatic single nucleotide variants (SNV) and 14,851 small insertions and deletions. Variant allele fractions (VAFs) clustered around 0.5 (Extended Fig.1a), confirming colonies derived from a single cell. There were no additional subclonal peaks in the VAF distribution confirming that few mutations were acquired in-vitro (Extended Fig.1b). All patients harbored mutated-JAK2 (9 JAK2V617F, 1 JAK2exon 12), and 9 patients had additional driver mutations, most commonly in DNMT3A (n=8), TET2 (n=4) and PPM1D (n=3) (Fig.1c).
Phylogeny of haematopoiesis and patterns of driver mutations
Phylogenetic trees were reconstructed from the presence or absence of SNVs across colonies (Extended Fig.1c, Methods). The phylogenetic trees of 3 patients with stable disease are shown in Fig.2 and for the remaining 7 patients in Fig.3 – these trees essentially depict the family relationships among cells currently contributing to blood production in each patient. Although the shapes and structures of the trees are unique to each patient, many common themes emerge. In patients with multiple driver mutations, the other drivers occurred both prior to or following JAK2V617F, as well as in independent HSCs, as previously reported23–25.
All patients had mixtures of some colonies with known driver mutations and some colonies without these drivers, suggesting that malignant haematopoiesis (both JAK2- and non-JAK2 mutant clones) co-exist with normal blood production in MPN patients. The colonies without driver mutations shared few, if any, mutations with one another, evident as long, isolated branches in the phylogenetic trees – this demonstrates that the residual non-malignant blood production in MPN patients remains highly polyclonal, as seen in normal individuals21. Colonies with driver mutations typically shared tens to hundreds of mutations, including the driver mutation – this is evident as a ‘clade’ in the phylogenetic tree, namely a set of lineages descending from a shared ancestral branch, and confirms their clonal origin. Immediately beneath the shared branch containing the driver, we observe many short branches, each containing only tens of mutations – this represents a ‘clonal burst’ in which the original mutated HSC expands to a sizeable population of cells. The shortness of these initial branches implies that the clonal burst occurs rapidly – this is especially evident in the clones with multiple driver mutations (Fig.3).
We observed a number of instances in which similar genetic changes were acquired by unrelated clones, so-called ‘parallel evolution’. Chromosome (chr) 9p copy-neutral loss-of-heterozygosity due to uniparental disomy (UPD) was observed as multiple occurrences within the same patient (PD6646, PD5182, PD5117 and PD9478), often with unique breakpoints (PD5117 in Fig.2, Extended Fig.2a-e) as observed before26. Two separate acquisitions of chr1q+ and 9q-were noted in PD5179, affecting different parental chromosomes in each instance (Extended Fig.3a-c). Multiple mutations affecting the same oncogene were also observed within individual patients – 2-3 DNMT3A mutations in PD5847, PD6629, and PD9478; 2 CBL mutations in PD6646, and 2 NF1 and PPM1D mutations in PD5182. (Fig.3). Many of these driver mutations were not part of the MPN clone (defined as the lineage harbouring mutated-JAK2). We also noted two independent acquisitions of JAK2V617F in PD4781 (Fig.3) on different parental chromosomes (Extended Fig.3d-e). Taken together, the parallel evolution of similar genetic aberrations within patients suggests that there are patient-specific factors shaping the evolutionary trajectories of MPNs.
We did not identify novel coding or non-coding driver genes in our patients but noted a clonal expansion in PD6646, aged 80, with no known cancer-associated driver mutation in the ancestral shared branch (Fig.3 Extended Table 3). We also noted smaller wildtype expansions in PD5117, who was 82 years old (Fig.2). These findings provide evidence of the single cell origin of previous reports of CH lacking driver mutations 27, although what drives these clonal expansions and how they relate to old age remain unclear.
Mutation acquisition in adulthood and impact of driver mutations
The number of somatic mutations in individual colonies was corrected for the size of the sequenced genome (affected by both copy-number aberrations and gender) and depth of sequencing. This adjusted SNV burden strongly correlated with patient age (Fig.4a) in keeping with a constant background rate of mutation acquisition throughout life2,21,22. Using a Bayesian approach, we modelled clade-specific background mutation rates in individual patients and accounted for excess SNV accumulation in early life due to rapid proliferation (Methods)28. In wildtype lineages, the median mutation acquisition rate was 17.9 per year (95% confidence interval (CI) 17.2-18.6, Fig.4b). Our estimates were confirmed by orthogonal mixed-effect modelling (Methods), and are consistent with previous studies18,19.
In 7 of 10 patients, the mutant clades exhibited a significantly elevated mutation burden (1.5-5.5 more mutations/year) compared to matched wildtype counterparts (Fig.4b), most noticeably for JAK2-mutated clades. This increase could reflect greater background mutagenesis or increased cell division rates. Telomeres, the protective sequences at chromosome ends that progressively shorten with age and cell division, were significantly shorter in JAK2-mutated colonies (p=0.002, Fig.4c). Given that telomere lengths in individual mutant colonies are not fully independent measures due to their shared clonal origin, we corrected for phylogenetic distance, and still found that JAK2-mutated colonies had shorter telomeres (−864bps, CI 679-1035bp, p<0.001, Methods, Extended Fig.4). There was also a significant increase in C>T transitions at CpG dinucleotides in JAK2-mutated clades compared with wild-type colonies (Extended Table 4, Extended Fig.5) compatible with their increased cell division history3. We used the tops of the trees to capture the earliest stages of life and estimated the number of mutations that are acquired during cell division, as has been previously reported21. From our estimate that 1.7 mutations are acquired per cell division (Extended Fig.6), we calculated that mutant HSCs undergo an additional 0.7-3.9 cell divisions per year relative to wildtype counterparts to account for their higher mutation burden.
Early acquisition of driver mutations during the lifetime of patients with MPN
Using these mutation rates, we then calculated estimates for the age at which driver mutations occurred in MPN patients. Specifically, we estimate the patient’s age at the beginning and end of branches containing driver mutations, as depicted on phylogenetic trees (Fig.2,3, Extended Fig.7) – this provides an age range within which the driver mutation or copy-number aberration most plausibly occurred. Copy-number aberrations were also independently timed by assessing the proportion of heterozygous and homozygous SNVs in affected regions, assuming clade-specific mutation rates (Methods).
In 5 patients in whom mutated-JAK2 was the first driver event in the MPN clade, mutated-JAK2 was acquired very early in life. PD5182, diagnosed at age 32yrs, acquired JAK2V617F between 9.1 weeks post conception (pc) and 4.1 months age (CI 4.2 weeks (pc)-1.3yrs). PD7271, diagnosed at age 20.8yrs, acquired JAK2V617F between 6.2 weeks (pc)-1.3yrs (CI 1 week (pc)-2.2yrs). The remaining 3 patients acquired mutated-JAK2 during childhood at the latest by 8.6yrs (CI around latest age estimate of 7.3-10.1yrs, PD5163), 9.2yrs (7.7-10.8yrs, PD9478) and 11.4yrs (9.1-12.4yrs, PD5117). In these JAK2-‘first’ patients, the mean latency between mutated-JAK2 acquisition and MPN diagnosis was 34yrs (range 20-54yrs). In a further two patients, JAK2V617F occurred as the second driver within a mutated-DNMT3A clade, with disease latencies of 12.1yrs (PD6646) and 27.4yrs (PD6629) from JAK2V617F acquisition. Latency to disease diagnosis from mutated-JAK2 acquisition, irrespective of the ordering, was 31yrs (range 12-54yrs). In the remaining three patients (PD4781, PD5847, PD5179), we were unable to precisely time JAK2V617F due to the presence of additional driver events, eg., 9pUPD, on the same branch. However, timing estimates of 9pUPD acquisition fell to decades before disease presentation (Fig.3) implying that JAK2V617F acquisition occurred even earlier than this.
Mutations in DNMT3A, the gene most commonly detected later in life in the context of age related CH13–15, were also acquired in utero or childhood (Fig.3). PD5182 acquired DNMT3Aess.splice between 19.4-22.2 weeks (pc) (CI 5.8 weeks (pc)-3.8 months), precisely the 20th, 21st or 22nd mutation from the start of life in that lineage. In PD5847, DNMT3AY660F was already acquired by the 23rd mutation, which was between 1.2-7.9 weeks (pc) (CI 4 days (pc)-12.8 weeks gestation). The canonical mutation DNMT3AR882H was acquired by 2.6yrs (CI 1.6-3.8yrs) in PD6629, and PD5163 acquired DNMT3AT275fs*41 by 7.8yrs (CI 6.5-9.5yrs). Acquisition of subsequent drivers was common in patients and separated by decades (Fig.3).
Clonal expansion rates are variable and influence disease latency
The pattern of branching, specifically, the timing of ‘coalescences’ in the mutant clade, as well as the final clonal fraction reached, reflect the rate of clonal expansion from the time of driver mutation acquisition to the time of sampling. We modelled HSC population dynamics with a forward-time simulator using a continuous birth-death process. We used approximate Bayesian computation to generate patient- and clone-specific estimates of the selective advantage, or fitness, of mutant clones, that is, the additional proportion (S) by which each clone expanded per year (Fig.5a, Extended Table 5a and Fig.8, Methods).
We observed variable rates of clonal expansion following acquisition of JAK2V617F (Fig.5a). PD7271, the youngest MPN patient in our cohort, had a rapidly expanding JAK2V617F clone at S=0.68/yr (CI 0.41-0.95). However, PD5117, one of the eldest patients in the cohort with diagnosis 54yrs after JAK2V617F acquisition, had a much slower growing clone (S=0.18, CI 0.13-0.23, Fig.5a), not far above recent growth rates estimated for JAK2V617F-CH29. When JAK2V617F was acquired on a mutated-DNMT3A backdrop, rates of expansion remained variable with S=1.28 (JAK2V617F/DNMT3Ap.? clone in PD6646, CI 0.92-2.66) and S=0.40 (JAK2V617F/DNMT3AR882H clone in PD6629, CI 0.27-0.56, Fig.5a). Overall, rates of clonal expansion following the acquisition of JAK2V617F were inversely proportional to the latency between mutation acquisition and disease diagnosis (Fig 5b).
Additional driver mutation acquisition corresponded with more rapid clonal growth. Within individual patients, we observed successive increases in expansion rates with sequential driver mutation acquisition (PD6629 DNMT3AR882H S=0.26 (CI 0.19-0.35); DNMT3AR882H/JAK2V617F S=0.40 (CI 0.27-0.56); DNMT3AR882H/JAK2V617F/TET2Q744fs*10 S=0.83 (CI 0.36-1.66), Fig.5a). The most rapidly growing clades in the cohort all harboured additional driver mutations, with S ranging from 1.28-2.33/yr (Fig.5a), which translates to the clone doubling in size every 7-10 months. We also observed very slow growing clones. In PD5847, the DNMT3AY660F clone was expanding at a rate of S=0.09 (CI 0.05-0.25, Fig.5a) since its acquisition in the first few weeks of life. This expansion rate is in line with selection estimates of DNMT3A-mutated CH from healthy aged individuals29 and demonstrates how mutation acquisition at the start of life, followed by a lifetime of slow expansion, can underlie CH observed in the elderly. Indeed, at very low selective coefficients, the probability of stochastic extinction of a clone was also higher, as would be expected (Extended Fig.9).
We considered whether selection bias for certain lineages during in vitro culture could have skewed our estimates of growth rates. Somatic mutations from the phylogenetic trees were deep sequenced using targeted recapture in bulk mature blood cells from the same patients (Fig. 1, Methods) to infer population estimates. We observed that clonal fractions of clades from phylogenetic trees were broadly concordant with population fractions measured in bulk blood samples (Extended Fig.10a). We estimated the proportion of lineages that may have diverged from shared branches harbouring driver mutations but which were not captured in our trees and also did not find any significant presence of lost lineages (Extended Fig.10b). This suggests that minimal sampling bias occurred from in vitro culture in our cohort, and also makes it unlikely that different clades are preferentially represented in different mature blood cell types, in accordance with previous observations21. We also used the population fractions of mutant clades in longitudinal bulk samples to corroborate the growth rates inferred from trees. We observed that clonal trajectories modelled from phylogenetic trees were in line with observed changes in clonal fractions in bulk longitudinal blood samples. Any differences in selected patients were explained by clinical interventions, in particular, treatment with interferon-alpha, either before or after sampling for phylogenetic trees (Extended Fig.11).
Early detection of MPN driver mutations and rates of clonal expansion
Given the latency between mutation acquisition and clinical presentation, we asked how much earlier in life one might have detected the mutant clone prior to diagnosis. The growth trajectories for the different mutant clones were estimated from population simulations and correlated with patient age (Extended Table 5b). For the slowest growing JAK2V617F clone (PD5117), we estimated that it took ~40 years for the mutant cell fraction to reach a 1% cell fraction, which was ~10 years before diagnosis. With detection limit of 0.01% cell fraction (1 in 10,000 cells), JAK2V617F would have been detectable ~40 years before diagnosis (Fig.5c). With more rapid JAK2V617F clonal expansion (PD7271), mutant cells would have been detectable at age 8yrs at 0.01% cell fraction (13 years prior to diagnosis, Fig.5c). Overall, with sensitive techniques detecting up to 1 in 10,000 aberrant cells16,30, we estimate it would have been possible to detect JAK2V617F across all patients >10 to 40 years before disease presentation, irrespective of the timing of JAK2V617F acquisition or the final clonal fraction reached (Fig.5d).
A complex and dynamic scenario can arise in patients with several mutant clades. In PD5847, clonal evolution has already occurred at diagnosis when the patient presented with life-threatening portal vein syndrome (Fig.3). In this patient, the in utero acquired DNMT3AY660F clone was growing modestly, taking ~30 years to reach a 1% clonal fraction. However, their JAK2V617F clone then acquired both 9pUPD and TET2N281fs*1 which then rapidly expanded (S=2.33/year, Fig.5e) to reach clonal dominance within a decade prior to diagnosis. This illustrates the rapidity with which clonal progression can occur and emphasises the current unmet need for early detection of driver mutations and estimation of clonal trajectories for risk stratification of patients.
Discussion
Our knowledge of the absolute timing of genomic aberrations and rates of expansion that drive human cancers is rudimentary. Sequencing of bulk cancer tissues, the final snapshots of a complex multi-step process of tumourigenesis, has captured the landscape of intra-tumour heterogeneity31, the relative ordering of genomic events24,32,33, and have provided broad estimates indicating that some chromosomal gains may occur several years to a decade or more before cancer presentation4,5. By using somatic mutations for lineage tracing, we re-trace life histories from early embryogenesis through to the single cell HSC origin of MPN and CH, and delineate driver mutation timing, clonal selection and clonal evolution over the life of 10 patients with MPN.
MPN are a common chronic blood cancer prevalent in up to 1 in every 1000-2000 individuals, with an increasing incidence with age34,35. Our study reveals, however, that regardless of age of diagnosis JAK2V617F is usually acquired early in life (earliest appearance few weeks post conception). DNMT3A mutations, most commonly associated with age related CH13–15, were also acquired in utero and during childhood with slow and steady growth over life. Our 10 patients were not pre-selected other than to capture a broad range of ages, yet to find early driver mutation acquisition in all 10 consecutively sequenced patients makes it highly probable that this would be a feature of the majority of MPN patients. Our model of in vivo MPN development is also consistent with previous observations of JAK2V617F detection prior to MPN diagnosis and in cord blood 36,37. Furthermore, the early acquisition of mutations associated with CH that slowly grow over life, provides an explanation for the very low burden CH detected in younger adults in some studies16,30. We identified that clonal expansions spanned the lifetime of MPN patients, often with sequential driver mutation acquisitions, including JAK2V617F as a second driver event, as observed before23. The lifelong trajectories to MPN provide a new model for cancer development which may be relevant to other organs, given the abundance of mutations under selection across histologically normal tissues7,9–11,13–16, and the experimental tools developed here could be applied to other cancers.
MPNs are unique in that 40% of patients only harbour a single somatically acquired genomic event, eg. in JAK219. However, we find the rate of clonal expansion upon JAK2V617F acquisition to be variable, which may reflect differences in cytokine homeostasis, iron stores, bone marrow microenvironment, HSC lineage bias, inflammatory insults and germline influences38–42. Such factors may dynamically contribute over the lifetime of the individual. Factors influential in early life, due to altered embryonic HSC properties or stochastic drift, could also shape the future trajectories of nascent clones. Overall, growth rates associated with JAK2V617F in MPN patients were greater than that inferred for JAK2V617F-CH29, which may account for the relative lack of clinical manifestations in the latter group. Rates of expansion for CH clones harbouring mutated-DNMT3A were 0.09-0.38/yr, in line with estimates from population based cohorts with CH29. We were unable to identify environmental or germline ‘triggers’ for either the start, or the speed of driver-mediated “clonal bursts” in our small cohort. Specifically, there were no prior exposures to chemotherapy. Concurrent potential exposures during life included smoking, obesity, infections (eg hepatitis C, mumps and whooping cough) and pregnancy.
MPN diagnosis is currently defined phenotypically, by blood count parameters and bone marrow histomorphology43. Our results indicate that the point at which a clinical diagnosis is made, represents one time-point on a continuous trajectory of lifelong clonal outgrowth, at which blood counts have reached certain thresholds, or clinical complications have already occurred. Current diagnostic criteria do not best capture when patients begin to have disease as life-threatening thromboses often trigger diagnosis. Furthermore, diagnostic criteria do not capture when individuals begin to be risk as those harbouring JAK2V617F in the general population have increased risk of thrombosis, altered blood count parameters and gravely increased risk of future MPN44. Our data show that mutant clones will generally have been present for 10 to 40 years before diagnosis and would have been detectable for much of this time using sensitive assays. Clonal fractions of 1%, the common cut-off used for population screening studies, already reflect decades of clonal outgrowth, and the rate of expansion of JAK2V617F strongly influences latency to disease presentation, more so than age at JAK2V617F acquisition or clonal fraction at diagnosis. Taken together, the key to early detection and prevention may lie in both detecting low burden mutant clones early and in establishing their rate of growth, by repeated sampling, to capture those individuals on a future trajectory to clinical complications. The cornerstone of MPN management currently is aimed at normalising blood counts and reducing risk of thrombotic or haemorrhagic events – such treatments are mostly safe and well-tolerated, and could be offered to individuals with high-risk molecular profiles. Our data also provide a strong rationale for the ongoing evaluation of measures45–48 that target the JAK2V617F clone in order to curb clonal expansion and subsequent clonal evolution.
Author contributions
JN, ARG and PJC conceived and directed the study. NW performed genomic, phylogenetic and population dynamics analyses with JN. JL assisted with signature and telomere analyses. LM assisted with low-input sequencing and mutation signature analysis. ALG assisted with clinical information. JN and EJB prepared samples. JN wrote the manuscript with input from coauthors. Authors reviewed and approved the manuscript.
Competing Interest Declaration
The authors declare no competing interests.
METHODS
No statistical methods were used to predetermine sample size. The experiments were not randomised and the investigators were not blinded to allocation during experiments and outcome assessment.
Patients and Samples
Patients were selected from JAK2-mutated MPN patients attending Cambridge University NHS Trust Hospital, UK, that had undergone previous whole exome sequencing. Apart from ensuring a wide representation of ages, patients were chosen at random. In doing so, we found that we captured different MPN subtypes, clinical presentations varying from asymptomatic blood count abnormalities to life threatening thrombosis, different MPN therapies, stable and progressed disease, and a wide mutation spectrum. This allowed us to capture a broad cross-section of MPN. Peripheral blood and bone marrow samples were obtained from patients with myeloproliferative neoplasms attending Cambridge Universities NHS Trust following written informed consent and ethics committee approval. Recapture samples were peripheral blood derived granulocytes apart from two samples which were whole blood and peripheral blood derived mononuclear cells. Constitutional samples were obtained from either buccal or T-cell DNA.
In-vitro colony culture and next generation sequencing
Peripheral blood mononuclear cells were isolated from patient blood or bone marrow samples, cultured for 14 days in MethoCult 4034 (Stemcell) and single erythroid haematopoietic colonies (burst forming unit-erythroid, BFU-E) were plucked and lysed in 50ul of RLT lysis buffer (Qiagen). Library preparation for whole genome sequencing used enzymatic fragmentation and the NEBNext Ultra II low input kit (NEB) with 150bp paired end on either Illumina HiSeqX or Novaseq machines. Reads were aligned to the human reference genome (NCBI build37) using BWA-MEM.
Somatic variant identification, genotype matrix of samples and variant loci level filtering
Single nucleotide variants (SNV) were identified using CaVEMan49 for each colony by comparison with both a matched germline sample, as well as an unmatched normal sample. Short Insertions and deletions were called using cgpPindel50. Copy-number aberrations (CNA) were identified using a matched normal ASCAT analysis. The union of colony SNVs and Indels was taken and reads counted across all samples belonging to the patient (colonies, recapture samples, buccal and T-cells) using VAFCorrect. The genotype at each locus within each sample is either 1 (present), 0 (absent) or NA (unknown). We inferred the genotype in a depth sensitive manner. Using a binomial model, we inferred genotype as either mutant or wildtype. The genotype was set to the most likely of the two possible states provided one of the states is at least 20 times more likely than the other. Otherwise the genotype is set to missing (NA). The VAF is usually 0.5 for autosomal sites, but for Chromosomes X, Y and CNA sites, we conservatively set it to 1/ploidy. For loss-of-heterozygosity (LOH) sites the genotype is overridden and set to missing if it is originally 0. A germline SNV Filter allowed for somatic mutation variant identification in the presence of modest levels of contamination in the germline sample and removal of germline variants. Further loci were removed if within 10bp of an indel, 10bp of each other, where more than a fifth of the colonies have depth<6, where more than a fifth of colonies had a missing genotype; where all samples harboured the variants; where the total mutant read count was significantly less than 0.9*Expected VAF*total depth across all colonies that have genotype=1 (Binomial Test); where the total mutant read count was significantly less than 0.9*0.5*total depth across all colonies that had at least 2 mutant reads (Binomial Test); and where the total mutant read count at sites with genotype=0 was significantly greater than 0.01*Depth.
Colony Level Filters
Colonies were initially removed if the CaVEMan detection sensitivity was below 60%. In addition, colonies were tested for cross-contamination with other colonies from the same patient after tree construction. We required that the mean VAF of variants from a colony that mapped to private branches were not significantly less than 0.45 (Bonferroni adjusted one-sided binomial test) as this is evidence of some cross-contamination. For colonies that were not part of the same clade, cross-contamination was tested for by identifying colonies that exhibited a VAF significantly more than 5% VAF on non-ancestral branches (Bonferroni adjusted one-sided binomial test).
Construction of phylogenetic tree topology
We constructed trees using Maximum Parsimony with MPBoot51. The input for the method is alignments based on the genotype matrix with missing values. This method also enables the rapid generation of bootstrap trees that correspond to the Maximum Parsimony trees that would be obtained by resampling the genotypes columns with replacement. Only SNVs were used to infer the topology but both SNVs and Indels were subsequently assigned to the branches. We developed an Expectation Maximisation method (R package “treemut”) to soft assign mutations to tree and to simultaneously estimate branch length. Having estimated the maximum likelihood probability that each mutation belongs to each branch we then hard assign mutations for ease of interpretation. Simulations indicated that this approach did not exhibit obvious biases in branch length estimation vs true branch length and that using the edge length implied by the hard assignment has a minimal effect on the deviation from the true edge length. Driver mutations and CNAs were assigned to the corresponding branches of the tree. Independent CAN events were confirmed by distinctness of breakpoints and haplotypes.
Branch Length Adjustment for SNV Calling Sensitivity
The length of private branches of low depth colonies are likely to be underestimated because of the limited sensitivity of the variant calling. The per colony sensitivity was estimated in a non-parametric fashion from the identified germline SNVs by measuring the proportion of germline sites that were called by CaVEMan and private branches were scaled by 1/sensitivity. A similar approach is taken for shared branches where the sensitivity is estimated as the proportion of germline sites in which at least one of the colonies that share the branch has the variant called.
Testing genes under selection
Genes that exhibit a deviation of ratio of non-synonymous to synonymous variants were evidence of non-neutrality. The SNVs and Indels were grouped by individual branches across the cohort and R package “dndscv” was used.
Targeted sequencing of bulk peripheral blood (recapture) samples
We used Agilent SureDesign to design a baitset that captured all unmasked shared mutations. For private branches, we assayed up to 4 randomly selected unmasked variants per year of approximate branch length, capping the total number per private branch at 80. Variants that passed the SureDesign “most stringent” filter were preferentially selected. Sequencing on Illumina Novaseq was undertaken to depth of roughly 300-400x across all recapture samples.
Mutational signatures
The SNVs that were mapped to the 10 patient trees were divided into per patient driver/wild-type clade-based groupings, where the mutations mapped to the clades were treated as an individual sample for the purposes of mutational signature extraction. De Novo Signature extraction was carried out using SigProfiler. Additional analysis was carried out using MutationalPatterns and custom scripts.
Mutation rate estimations
We estimated mutation burden in the wildtype cells using two methods. We used linear mixed effects modelling to estimate the wildtype mutation rate using mutation burdens adjusted for depth of sequencing and excluded CNA regions, with age as a random effect and depth as a fixed effect. Only the timepoint with the most wild-type colonies is included in the model (this only affects PD5182). This fitted model has intercept 98.3 (95% CI 19.2-177.5) and the per patient wild type mutation rate in mutations per year is drawn from a distribution with mean 16.5 (95% CI: 14.8-18.2) and a variance of 1.1. The second method uses Bayesian modelling to jointly fit wild type rates, mutant rates and absolute time branch lengths under the assumption that the observed branch lengths are Poisson distributed with Mean = Duration × Sensitivity × Mutation Rate. The model incorporates an excess mutation rate in early life that will add an average of 33.5 mutations during the first 6 months post conception. The mean branch timings are directly sampled from the posterior distribution and by construction the resulting trees are guaranteed to have a root to tip distance that matches the sampling age of the colony. Models were fitted across four chains each with 20,000 iterations including 10,000 burn-in iterations. The model was coded in R and Rstan and inferred using the Rstan implementation of Stan’s No-U-Turn sampler variant of Hamiltonian Monte Carlo method52. The resulting posterior mean and standard error of each patient’s wild type rate were combined in a random effects meta-analysis using the R package metafor.
Methods for timing LOH and amplifications
The timing methods require a rough approximation for the number of expected detectable mutations, L, in the LOH/CNA region for the duration of the branch. Firstly, we estimate a local relative somatic mutation rate for mutations detectable by CaVEMan in autosomal regions. The rate is measured by counting distinct mutations across a panel of samples consisting of those colonies in the 10 patients that do not exhibit copy number aberrations (350,371 mutations across 594 colonies). The genome is divided into 100Kb bins and the number of passed somatic mutations is counted across all samples in the panel, to give a count ci for bin bi. The probability that a given mutation occurs in bin i is estimated by and with standard error in that estimate of For a given copy number region H then . For a branch of duration t and with global mutation rate λ then E(L) = λtp(mut) ∈ C) and where it should be noted that errors in t and λ are not included in the variance estimation. All somatic mutations that occur prior to the LOH event, occurring a fraction x along the branch, will be homozygous with detection sensitivity SHOM and those after will be heterozygous with detection sensitivity SHET. We model the mutations as arriving at a constant rate along the branch and fit the following model for x : NHET ~ Poission(1 − x)LsHET) and NHOM ~ Poission(xLsHOM) with priors x ~ Uniform(0,1) and L ~ N(E(L, Var(L) and where SHOM = 0.5 (assuming perfect detection of homozygous mutant variants) and SHET is estimated from germline SNPs as previously discussed. Somatic mutations that occur prior to the CNA event, occurring a fraction N along the branch, have an equal chance of exhibiting VAF=1/3 or VAF=2/3, whereas those occurring after the event will always have VAF=1/3. and where the priors are as in the LOH model above. The detection sensitivities s1/3 and s2/3 are similar to shet because of the additional sequencing depth afforded by the duplication. Unlike the LOH case, the value of x will be relatively unaffected by L because of the similarity of s1/3 and s2/3.
Telomere analysis
The mean telomere lengths of the colonies were estimated using Telomerecat with batch correction using cohort wide information to correct the error in F2a counts as detailed. Given the slight discrepancies in length found on multiple readings, each telbam was analysed 10 times and the average length was used. Given that colonies sharing a driver mutation share a more recent common ancestor, and are therefore, not truly independent measures of telomere lengths, we measured telomeres adjusting for phylogenetic distance between colonies. To specifically ask how much shorter telomere lengths become as a result of a JAK2-mutation, we fit a phylogeny aware mixed model for the mean telomere length with a patient specific intercept using the MCMCglmm library in R (Iterations = 100001:1099001; Thinning interval = 1000; Sample size= 1000; DIC: 11187.69). This identifies an average loss of 21bp of telomere length per year of life and that JAK2-mutant clades exhibit ~864bp shorter telomeres.
Continuous Time Birth Death Model
Each cell has a rate of symmetric division and a rate of symmetric differentiation (or death). Asymmetric divisions do not affect the HSC genealogy and are therefore not explicitly included in the model. Let l be the wild type rate of symmetric division, measured in divisions per day. We model selective advantage s as an increased rate of symmetric division αmut = α(1 + s). We assume during the growth phase that the cells population grows unrestrained by death. Once the specified equilibrium population size, N, is reached then the death rate β, which is the same for every cell, matches the average division rate. Following the acquisition of a driver the mutant cell population grows stochastically until the population is sufficiently large, when the growth becomes essentially deterministic following a logistic growth function where in the early stages the exponential growth process exhibits an annual rate of growth S, given by: S = exp(αs) − 1. The above model is implemented using the Gillespie algorithm where the waiting time until the next event is exponentially distributed with a rate given by the total division rate + total death rate, this event is then division with probability=total division rate/(total division rate + total death rate). If the event is Division then the choice of which cell is given by a probability proportional to the cell’s division rate whereas if the event is Death then all cells are equally likely to be chosen. Implementation was in C++ with an R based wrapper. The simulator maintains a genealogy of the extant cells, together with a record of the number of symmetric divisions on each branch, the absolute timing of any acquired drivers and the absolute timings of branch start and end. The package also provides mechanisms for sub-setting simulated genealogies whilst preserving the above per branch information.
Approximate Bayesian Computation
Approximate Bayesian Computation (ABC) was used to reproduce the timing and shape of clonal expansions in our phylogenetic trees using our simulator to generate sampled trees. The procedure was as follows: Infer mutation rate, λ, as the mean root to tip distance divided by the age of sampling. The observed mutation count at the start and end of the branch carrying the driver in the experimental tree is denoted by Mstart and Mend respectively. Fix symmetric division rate at 1 division per year. Sample N from Log10(N) ~ Uniform(3,6.5). Sample age of driver acquisition by resampling mutation counts from a Poisson distribution: . Sample S from S ~ Uniform(0.05,5). Simulate the population: Simulate the tree with initial division rate of 0.1 per day until population has grown to the equilibrium population size. Simulate neutral evolution until time Tdriver. Save the state of the simulation (*). Introduce the driver with the specified selection coefficient. If the driver lineage dies out before the sampling age is reached then return to the saved state (*) and try again. A tree with the observed number of mutant samples is subsampled from the population of extant cells. For each simulation the following summary statistics were calculated (i) Total deviation in the simulation mutant clade’s number of lineages through time (LTT) with respect to the patient clade of interest, and (ii) deviation of the simulation-based population clonal fraction with respect to the patient clade’s aberrant cell fraction. In all cases but two the aberrant cell fraction was calculated as the proportion of sampled mutant clades. For PD5163_JAK2 and PD5182_JAK2 the aberrant cell fraction at diagnosis is used, prior to the commencement of interferon-alpha treatment which led to clone size reduction. The total distance score is then calculated as a combination of the above two scores. For each clade between 726,678 and 959,453 simulations were run. In each case the posterior distribution was approximated by the top 0.02% simulations.
Aberrant Cell Fraction
In recapture samples, a per branch aberrant cell fraction can be calculated as twice the aggregate mutant read fraction where only autosomal variants that map to the branch and are outside of the copy number aberrant regions are included. For each patient the trajectory of the aberrant cell fraction was retrieved from the top 0.2% of simulations. The simulator periodically takes snapshots at approximately daily intervals and measures the aberrant cell fraction as the current fraction of cells that carry the driver mutation. Subsequently, for each simulation these daily snapshots are binned into a common sequence of 10-day periods over each of which the aberrant cell fraction is averaged. We present the 2.5%, 50% and 97.5% quantiles for each binned period calculated across the simulations.
Stochastic Extinction
The probability of extinction in a homogenous birth death process is the ratio of death rate to birth rate53 which in our case is or equivalently . In the ABC simulations we record the number of attempts required to introduce the drivers. We then verified that the simulator behaved as expected by gathering all simulations across the analysed mutant clades and then restricted to the 13,048,861 simulations where the driver was introduced after development (>1 year post conception). The simulations were binned into Selection Coefficient bins of width=0.05 and log10(N) bins of width=0.1. The extinction probability was estimated in each bin using the maximum likelihood estimator where à is the number of simulations in the bin and each simulation, i, fixes after ai attempts.
Start of Life Polytomies
The polytomies were used to estimate lower and upper bounds for the mutation rate per symmetric division during embryogenesis21, whereby the number of edges with zero mutation count at the top of the tree (up to the first 10 mutations) is inferred from the number and degree of polytomies assuming an underlying tree with binary bifurcations. The mutations per division are assumed to be Poisson distributed. A maximum likelihood range is then calculated in two steps first using the 95% confidence interval of the proportion, p, of zero length edges and with this leading to a maximum likelihood estimate for the Poisson rate as −log p.
Acknowledgements
We thank Cambridge Blood and Stem cell biobank, funded by the Cambridge Cancer Centre and Wellcome Trust Cambridge Stem Cell Institute, CASM and DNA pipelines for their assistance. We thank Sam Behjati and Claire Harrison for comments. The study was supported by Cancer Research UK (JN), EHA Research Award (JN), MPN Research Foundation (JN) and the Wellcome Trust (PJC, ARG). PJC is a Wellcome Trust Senior Clinical Fellow. Work in the ARG Lab is supported by the Wellcome Trust, Bloodwise, Cancer Research UK, the Kay Kendall Leukaemia Fund, and the Leukaemia and Lymphoma Society of America. JN is a CRUK Clinician Scientist fellow. We thank the patients for their participation in the study.
Footnotes
↵† senior authors