Lineage tracing of human embryonic development and foetal haematopoiesis through somatic mutations

To date, ontogeny of the human haematopoietic system during foetal development has been characterized mainly through careful microscopic observations. Here we used whole-genome sequencing (WGS) of 511 single-cell derived haematopoietic colonies from healthy human foetuses of 8 and 18 post-conception weeks (pcw) coupled with deep targeted sequencing of tissues of known embryonic origin to reconstruct a phylogenetic tree of blood development. We found that in healthy foetuses, individual haematopoietic progenitors acquire tens of somatic mutations by 18 pcw. Using these mutations as barcodes, we timed the divergence of embryonic and extra-embryonic tissues during development and estimated the number of blood antecedents at different stages of embryonic development. Our analysis has shown that ectoderm originates from a smaller set of blood antecedents compared to endoderm and mesoderm. Finally, our data support a hypoblast origin of the extra-embryonic mesoderm and primitive blood in humans.


Introduction
Human embryonic development starts with the division of the zygote to two daughter cells called blastomeres. After several further divisions, at approximately the 100-cell-stage, the cells physically separate into the inner cell mass (ICM) which is fated to become the embryo, and the trophoblast which develops into the chorionic sac and foetal component of the placenta. The ICM then undergoes further differentiation into the epiblast and hypoblast, which primarily contribute to embryonic and extra-embryonic tissues respectively. At 3 weeks of development, the epiblast differentiates into the three germ layers (ectoderm, mesoderm and endoderm). Ectoderm gives rise to the epidermis and nervous system; mesoderm generates the heart, kidney, skeleton, muscles and connective tissues; while endoderm forms the liver, lungs and inner layer of the digestive tract 1 .
The first haematopoietic cells (mainly nucleated erythrocytes and, to a lesser extent, macrophages and megakaryocytes 2-4 ) start to form within blood islands in the yolk sac from 16 post-conception days (pcd). However, the developmental origin of these so-called primitive haematopoietic cells in humans is still debated. Yolk sac blood islands arise from the extraembryonic mesoderm 5 . The precise origin of this extra-embryonic tissue remains unclear with some evidence in mice pointing to the primitive streak 6,7 , while histological studies in humans suggest it derives from the hypoblast 7,8 . Interestingly, human foetal histological studies have shown that at 8 pcw more than 90% of erythroid cells in circulating blood are still of primitive origin, while more than 80% of those found in liver derive from the aorta-gonad-mesonephros (the so-called 'definitive' wave of haematopoiesis). Thus, it has been speculated that definitive erythrocytes are temporarily retained in the liver 9 .
tracing studies 12 . By analysing patterns of mutations that have accumulated in somatic cells over time, it is possible to identify clonal relationships between cells and thereby reconstruct the phylogeny of a population 13 . In this study, we performed whole-genome sequencing (WGS) of single-cell-derived blood colonies from human foetal HSPCs, coupled with deep targeted sequencing of matched tissues of known embryonic origin, to build a phylogenetic tree of foetal development (Fig. 1a).
We first collected haematopoietic organs from two karyotypically normal male human foetuses  Table 1). The obtained sequencing depth (mean 22.6× for the 8 pcw colonies and mean 12.2× for the 18 pcw colonies) was sufficient to reliably call single-nucleotide variants (SNVs) (Extended Data Fig. 2). We applied stringent filtering criteria to remove germline mutations, recurrent artefacts from library preparation and sequencing, and mutations that were likely to have been acquired in vitro.
To validate the filtering strategy, we performed targeted resequencing of 501 of the 511 colonies to estimate the proportion of mutations that passed our filtering but were in fact subclonal or artefactual. Over 96% of mutations shared by more than one colony in the 8 pcw foetus and 100% of those in the 18 pcw foetus had variant allele fractions (VAFs) consistent with somatic mutations acquired in vivo (Extended Data Fig. 3a, b). Mutations passing filtering criteria were assigned to each cell. Mutations that were shared by multiple colonies were used to reconstruct phylogenetic trees of blood development (Fig. 1b, c). We constructed a tree in which the length of each branch is proportional to the number of somatic mutations accumulated over time within the inferred ancestral lineage. For clarity, here we refer to lineage as the direct descendants of a given ancestral cell. In addition to HSPCs, we collected non-haematopoietic tissues of mesodermal, ectodermal, endodermal and extra-embryonic origin from the same two foetuses. We dissected the placenta, gut, kidney, heart and skin from the 8 pcw foetus; and kidney, heart and skin from the 18 pcw foetus. All tissues were fixed, paraffin-embedded, sectioned and cut using lasercapture microdissection (LCM) to obtain well-defined tissue structures (which we refer to as microbiopsies from now on) of specific extra-embryonic and embryonic origin. Microdissected tissues included the mesenchymal core from the placenta (deriving from the extra-embryonic mesoderm), syncytiotrophoblast (from the trophoblast), circulating blood (from the primitive wave of haematopoiesis), gut epithelium (from the endoderm), heart muscle (from the lateral plate mesoderm), kidney tubules (from the intermediate mesoderm), limb deep undifferentiated tissue (from mesoderm) intervertebral disc (from the paraxial mesoderm), epidermis (from the surface ectoderm) and skin peripheral nerve (from the neural ectoderm) (Extended Data Fig. 4). We then performed deep targeted sequencing (Extended Data Table   2) using custom DNA bait sets designed to include all shared and a majority of private somatically-acquired SNVs (please see Methods) detected in HSPC colonies by WGS.

Acquisition of somatic mutations in early embryonic and foetal development
We estimated mutation burdens following adjustments for sensitivity and removal of in vitro mutations. In the 8 pcw foetus, the mean corrected SNV burden was 25 mutations per colony (standard deviation (sd) 6.1 mutations per colony) and in the 18 pcw foetus, the mean SNV burden was 42.6 mutations per colony (sd 7.7 mutations per colony) (Extended Data Fig. 5a).
Interestingly, the observed standard deviations are broadly equivalent to that expected for a Poisson distribution, suggesting that all the lines-of-descent sequenced within each foetus have experienced the same mutation rates during development.
We detected a mean of 1.82 indels (range: 0 -6) in the 8 pcw colonies and 1.23 (range: 0 -5) in the 18 pcw colonies (Extended Data Fig. 5b). Due to the low numbers of indels, an equivalent sample-specific correction approach was not feasible. However, we used a Bayesian Poisson regression model to calculate a corrected mean indel burden of 2.09 indels per colony (95% confidence interval 1.90 -2.29) for the 8 pcw foetus and of 2.13 indels per colony for the 18 pcw foetus (95% confidence interval 1.91 -2.39). On average, the proportion of SNVs shared between multiple HSPC colonies was 0.62 for the 8 pcw foetus and 0.49 for the 18 pcw foetus (Extended Data Fig. 5c).
To further examine the genomic distribution of mutations identified by WGS we assessed the proportion of SNVs mapping to different genomic features. We identified a higher than expected proportion of SNVs in intergenic regions of the genome (Extended Data Fig. 6a). In contrast, we detected a lower than expected proportion of mutations in the introns, exons, and regions 5 kb upstream/downstream of the gene body (Extended Data Fig. 6b-c). This is consistent with previous reports of lower mutation rates in regions of transcription and accessible chromatin 14-16 .
The 96-profile mutation spectrum, which incorporates the trinucleotide context of the substitution, revealed a predominance of C>T transitions, primarily at CpG sites, which are associated with spontaneous deamination of methylated cytosine 17 . In both foetuses, the mutation spectrum of clonal mutations seen only in single colonies closely matched those of mutations shared by two or more colonies, confirming that the mutation spectrum does not noticeably change across the first half of gestation (Extended Data Fig. 7).

Defining clonal relationships between foetal HSPCs
To determine clonal relationships between HSPCs, we constructed a phylogenetic tree for both 8 and 18 pcw foetuses (Fig. 1b, c). The root of a somatic lineage tree is, by definition, the zygote -it is likely that the initial bifurcation of the phylogeny therefore marks the first zygotic division to two blastomeres. Interestingly, the reconstructed phylogenetic trees revealed unequal contribution of each blastomere to the blood compartment (5:1 for 8 pcw and 59:1 for 18 pcw) indicating even higher asymmetry of embryonic contribution is possible for the first zygotic cell division than previously observed 13,18 . From this point onwards in our two foetuses, there was a profusion of high-degree polytomies, defined as branch-points in the phylogenetic tree generating more than two descendants. In the context of a somatic lineage tree, a polytomy can only occur when there are cell divisions that generate no detected somatic mutations. Of the 5 cells from the 8-cell stage that contributed significantly to the HSPC population there are four polytomies (12-, 10-, 5-, 5degree) and one dichotomy. Modelling of these polytomies suggests that the mutation rate decreases to less than 0.9 mutations/cell division at this stage of embryonic development (approximate Bayesian computation, see Methods). This coincides with zygotic genome activation in humans 20 , which may enable expression of DNA repair machinery.

Timing the divergence of gut and blood during foetal development
In order to precisely determine the timing of individual mutation acquisition relative to key developmental events (i.e. trophectoderm formation, gastrulation and organogenesis), we deep-sequenced non-haematopoietic tissues for the somatic mutations that had been identified in HSPCs (mean tissue sequencing depth 450×, Extended Data Table 2). As we have seen, the mutation rate during embryogenesis is sufficiently informative that the phylogenetic tree essentially represents a high-resolution lineage tree defining the initial cell divisions of the embryo. A given mutation occurs in a specific embryonic cell, which can be placed accurately on the lineage tree -this mutation then acts as a permanent clonal mark inherited by all descendants of that cell. Using targeted sequencing of different tissues, with different developmental origins, we can identify mutations that are present in multiple tissue layers, organs or regions -such mutations must have occurred in cells that preceded the developmental split between those tissues. Therefore, the tissues that shared the fewest mutations with HSPCs were likely to be the first to separate from ancestral lines-of-descent that gave rise to the haematopoietic cells (blood antecedents from now on) and greater sharing of mutations in a given tissue suggests it diverged from blood antecedents later in development.
Using laser-capture microdissection, we generated multiple microbiopsies of nonhaematopoietic tissues, such as the epithelium of the gut, which derives from the endoderm (Fig. 2a). DNA from the microbiopsies underwent targeted sequencing to characterise the presence or absence of each mutation, as well as the fraction of cells carrying each mutation in that microbiopsy.
For gut epithelium, we observed that mutations close to the root of the HSPC phylogenetic tree were universally present in the different microdissections, and contributed to high fractions of cells in the individual biopsies (Fig. 2a). From this, we can infer that the earliest cells in the embryo, such as those in the 4-cell and 8-cell cleavage stages, contributed large numbers of descendants to each of the gut microbiopsies (as well as blood HSPCs). As we traverse the phylogenetic tree from the root towards the tips, we found that the mutations became progressively less well represented in individual microbiopsies, before eventually becoming undetectable. Since each microbiopsy only samples a few hundred to thousand cells, and the average sequencing coverage was 70x per microdissection, we aggregated data from all 12 microbiopsies of gut epithelium to improve our statistical power for detecting mutations occurring later in embryogenesis (Fig. 2b). The aggregated data for gut revealed that many lineages (≥60) made substantial contributions to both gut epithelium and blood HSPCs. These lineages must have predated gastrulation, when endoderm and mesoderm split from one another. Following individual lines-of-descent from root towards tip, different patterns emerged -for some, the cellular contribution of successive generations in that line-of-descent decreased steadily, suggesting that both daughter cells of each successive cell division contributed to gut epithelium. For others, the cellular contribution to gut epithelium stayed relatively constant over several generations, suggesting that only the daughter cells in that line-of-descent contributed meaningfully to gut epithelium.

Timing specification of extra-embryonic and embryonic tissues
We proceeded in the same way, using LCM and deep targeted sequencing, to characterise all the embryonic and extra-embryonic tissues available to us (Extended Data Table 2). This enabled us to time the progressive divergence of different tissues from blood antecedents during embryonic development ( Fig. 3a, b).
In the mouse, the molecular distinction between trophoblast and ICM occurs at the 8-16 cell stage, but the exact timing in humans remains unknown 21 . In the 8 pcw foetus, we found that the trophoblast-derived tissue was the first to diverge from blood antecedents at the 4-to 16cell stage embryo (corresponding to 2 to 4 generations from the zygote; generation is defined here as a round of cell division), thus following a timeline similar to mouse development. We can directly infer the presence of five distinct blood antecedents present at the time of trophoblast divergence in the embryo (Fig. 3c). The next major event in embryogenesis is epiblast specification. We can infer that a minimum of 20 blood antecedent lineages already existed prior to epiblast specification (Fig. 3c). A recent study of in vitro human development up to the pre-gastrulation stage reported that specification of trophoblast, hypoblast and epiblast was evident by immunostaining of specific markers at 8 days post fertilisation. At this stage, the embryo comprised ~500 cells, with ~400 cells in the trophoblast, ~60 in the epiblast and ~40 in the hypoblast 22 . Therefore, there is a strong concordance between our estimates and the estimates based on immunostaining of epiblast-committed cells in embryos at the specific developmental stage. Our analysis, however, has the property that we can ascertain their ultimate contribution to organogenesis, not just their expression pattern at the time of specification. The concordance in numbers by the two methods does indeed suggest that the majority of cells expressing epiblast markers do in fact go on to contribute to detectable fractions of cells during organogenesis.
Proceeding further into embryonic development, we infer that at least 70 blood antecedents existed prior to mesoderm specification. Interestingly, the number of mutations shared between HSPCs and lateral, intermediate and paraxial mesoderm-derived tissues compared to endoderm-and ectoderm-derived tissues was similar, indicating that specification of each mesodermal component occurred within a few generations of mesoderm formation (Fig. 3a).
The origin of extra-embryonic mesoderm in humans has been debated for the last five decades, with mouse studies showing it derives from the primitive streak, while histological observations in humans point towards a hypoblast origin 6,77,8 . To address this issue, we compared the scope of shared mutations in the mesenchymal core of the placental villi and peripheral blood to that of tissues of known trophoblast and post-gastrulation origin. We observed that tissues derived from extra-embryonic mesoderm clustered separately from trophoblast-derived and post-gastrulation tissues ( Fig. 3d and Extended Data Fig. 8a). In addition, 90% of trophoblast lineages were already lost in the tree by the fourth generation, in contrast to 25% and 50% of peripheral blood and placenta mesenchyme respectively and 10% of post-gastrulation lineages (Fig. 3e). Our data therefore support the hypothesis that the extra-embryonic mesoderm in humans emerges from the hypoblast 8 .

Relative contributions of cells to germ layers at gastrulation
It is not known whether equal numbers of cells at gastrulation contribute to the tissues of all three germ layers. Interestingly, we found that the pattern of shared mutations in ectodermal tissues was distinct from that observed in mesodermal and endodermal tissues (Fig. 4). In non-haematopoietic mesodermal and endodermal tissues, we detected mutations from the majority of the blood antecedents up until 5-6 generations from the zygote, though each antecedent gave rise to a small proportion of cells from sequenced tissues at 8 pcw ( Figure   4a, 4e and 4g). This implies that many epiblast cells contributed to mesoderm and endoderm development, and made broadly similar, but individually small, fractional contributions to these germ layers.
In ectoderm, however, mutations from fewer blood antecedents were detected, though with each giving rise to a relatively high cellular fraction of sequenced ectodermal microbiopsis ( Figure 4b). Importantly, the individual lines-of-descent for ectoderm were detectable down to the same number of generations as for endoderm and non-haematopoietic mesoderm (typically 6-8 generations): there were just fewer of them. Moreover, following individual linesof-descent from root to tip in the ectodermal deep sequencing showed that cellular contributions frequently stayed constant over several generations. This implies that the different pattern evident for ectoderm was not explicable by varying statistical power of mutation detection or the problem of lineages missing from our phylogeny. Instead, it suggests that relatively fewer cells from the epiblast contributed to the ectodermal tissues we sequenced than to endoderm or non-haematopoietic mesoderm.
A comparable pattern was seen in the 18 pcw foetus with two blood antecedents from around the ninth generation giving rise to a quarter of cells from sequenced ectodermal tissues. A similar limited set of blood antecedents predominated in all three ectodermal tissues that we sequenced (epidermis, hair follicle and peripheral nerve). While we cannot exclude the possibility that ectoderm antecedents migrate to a given anatomic region of the body, prior to their specification to distinct ectodermal derivatives, the more plausible conclusion would be that ectoderm originates from a smaller set of antecedents compared to endoderm and mesoderm.

Trafficking of HSPCs across different haematopoietic sites
The colonisation of the long bones by haematopoietic cells from the liver starts around the 11th week of development 9 . From then on and up until birth, both the liver and the bone marrow are active sites of haematopoiesis. To investigate the kinetics of HSPC migration across early haematopoietic organs in 18 pcw foetus, we collected HSPCs from three separate sites of blood production. Out of the 234 HSPCs analysed, 78 were isolated from the liver, 77 from the bone marrow of femur 1 and 79 from femur 2. We did not observe statistically significant clustering (P=0.24) by organ of origin in the phylogeny, suggesting extensive traffic of haematopoietic stem and progenitor cells between liver and bone marrow (Extended Data Fig.   9a,b).

Discussion
Our data reveal an initial high rate of somatic mutation acquisition, which drops markedly after 3 generations, coinciding with maternal-to-zygotic transition. Although at this stage of development, zygotic transcription is still inactive, the extensive chromatin accessibility 23,24 and dilution of maternally-produced DNA-repair factors in dividing blastomeres could temporarily leave the genome more prone to accumulating mutations in the first generations.
The reconstructed phylogenetic trees revealed an unequal contribution of each blastomere to the blood compartment which is in line with the stochastic nature of early cell fate decisions.
This was particularly striking in the 18 pcw foetus, where just four out of 234 examined blood cells originated from one of the blastomeres. These four cells did not share any mutations with non-haematopoietic tissues of mesodermal and ectodermal origin (Extended Data Fig. 8) suggesting that the progeny of that blastomere had modest contribution to the embryonic tissues by either committing preferentially to trophoblast and hypoblast or dying out.

Ethics and tissue acquisition
Tissues used in this study were isolated from two karyotypically normal male human foetuses

Generation of single-HSPC colonies
Foetal liver tissue was passed through a 70 m strain with cold PBS (Gibco) to generate a single-cell suspension. Bone marrow was isolated by flushing cold PBS into the cavity of the two femurs (for the 18 pcw sample) and passed through a 70 m strain to generate a single cell suspension. Cells were centrifuged at 300 g for 5 minutes at 4°C and resuspended in Red

Tissue processing for LCM
Tissues from the 8 pcw (placenta, heart, kidney, gut, skin and vertebrae) and 18 pcw (skin, kidney and heart) samples were fixed in PAXgene Tissue FIX (Qiagen) for 4 and 24 hours respectively and embedded in paraffin using a Tissue-Tek VIP machine (Sakura). Sections of 5-10 m were cut using a microtome (Leica-RM2265), placed on PEN MembranSlide glass slides with a 2.0 m membrane (Leica) and dried in a laboratory oven (Genlab) at 37°C.
Sections were dehydrated with a series of ethanol (VWR) washes and stained with Mayer's haematoxylin (Sigma) / Accustain Eosin Y (Sigma) before a final xylene (Sigma) cleaning step.
Slides were dried overnight. High resolution imaging was performed using a slide scanner (Hamamatsu). Areas of interest were annotated on tissue scans and isolated with a microscope equipped with a laser-capture microdissection system (Leica LMD7).
Microbiopsies were collected in 96-well plates and lysed using the Arcturus PicoPure Kit (Applied Biosystems) according to the manufacturer's instructions.

Library preparation and sequencing
Library preparation for both the HSPC colonies and LCM microbiopsies was performed as previously described 26 . Paired end sequencing reads (150bp) were generated using either the Illumina NovaSeq® 6000 platform (8 pcw foetus) or Illumina HiSeq® 4000 (18 pcw foetus) resulting in ~22.6× coverage and ~12.2× coverage per colony respectively. BWA-MEM was used to align sequences to the human reference genome (NCBI build37).

Mutation discovery and tree-building from whole-genome sequencing data
Whole-genome sequencing data from single-cell colonies demonstrated variable levels of contamination by mouse DNA from the mouse feeder cell layer. Therefore, sequencing reads were filtered using the 'Xenome' algorithm which tests each read for mapping against the human and mouse reference genomes, and removes reads of clear mouse-origin, or that are ambiguous 27 . Remaining reads were aligned to the human reference genome and singlenucleotide variants (SNVs) and indels were called against an unmatched reference genome using the in-house pipelines CaVEMan and Pindel 28,29 . For all mutations passing quality filters in at least one sample, in-house software (cgpVAF, https://github.com/cancerit/vafCorrect) was used to produce matrices of variant and normal reads at each site for all HSPC colonies from that foetus.
Multiple post-hoc filtering steps were then applied to remove germline mutations, recurrent library prep and sequencing artefacts, and probable in vitro mutations, as detailed below: Filter threshold levels for filter steps 1-4 were set by using a combination of past experience, review of threshold parameter histograms, and experimentation. Mutational signatures and consistency with the phylogenetic tree were used for validation. In general, more stringent thresholds were used than in analyses on adult samples, due to the low number of anticipated true positives.

Phylogenetic tree construction
For purposes of tree building, samples were assigned a genotype for each remaining mutation.
All samples with ≥2 variant reads and a probability > 0.05 that counts came from a distribution expected for a somatic mutation were genotyped as present. Samples with 0 variant reads and a depth of ≥6 at a given site were genotyped as absent. Samples not meeting either of these criteria were genotyped as 'unknown'.
A genotype matrix of shared mutations (a positive genotype ≥1 sample [i.e. that would be informative for tree-building]) was fed into the MPBoot program 30 . This can incorporate 'unknown' genotypes, and constructs a maximum parsimony phylogenetic tree with bootstrap approximation.

Mutation assignment back to the phylogenetic tree
All mutations retained post-filtering were then assigned back to individual branches on the phylogenetic tree using the in-house developed R package 'treemut' (https://github.com/NickWilliamsSanger/treemut). This goes back to the original count data, and uses a maximum-likelihood approach to assign each mutation to a specific branch. Edge lengths are then made proportional to the number of mutations on the branch.

Inference of the mutation rate after the 8-cell-stage embryo
After the third round of cell division (corresponding to the 8-cell-stage embryo), the 8 pcw HSPC phylogeny shows a profusion of high degree polytomies. Specifically, of the five blood antecedents from the 8-cell stage for which we have significant numbers of descendants, there are 12-, 10-, two 5-degree polytomies and a single dichotomy. The true degree of these polytomies may be underestimated if descendant lineages are not captured in the HSPC phylogeny. We ran 1000 simulations of 5 independent cells dividing and acquiring mutations randomly according to a Poisson distribution with lambda corresponding to a specific average mutation rate, and recorded the proportion of simulations that resulted in polytomies at their roots of at least the degrees observed. We repeated these simulations for average mutation rates ranging from 0.5 to 1.5 mutations per cell division. These simulations showed that once the mutation rate rose to 0.9 mutations per cell division, ≤5% of simulations gave results consistent with the data, and once the mutation rate was 1.0, this proportion dropped to ≤1%.

Targeted sequencing
Individual custom DNA capture panels were designed for each foetus according to the manufacturer's guidelines (SureSelect XT Custom DNA Target Enrichment Probes, Agilent). For the 18 pcw foetus, 2137 private SNVs were included in the bait set. Of these, 1770 achieved a depth of ≥8× in the colony in which they were originally called (the "minimum depth" set); 273 achieved a depth of ≥40× in the colony in which they were originally called (the "high depth" set). The read counts from each set were used as input to a binomial mixture model to ascertain the proportion of mutations that were clonal (and therefore likely to be true somatic mutations), sub-clonal (and therefore in vitro mutations), or absent. The binomial mixture model uses an expectation maximisation algorithm to calculate the best combination of between 1 and 4 binomial distributions to fit the observed data. For each individual mutation it assigned the most likely binomial distribution for it to be derived from i.e. was it most likely clonal or sub-clonal. Similar results were obtained with both the "minimum depth" and "high depth" mutation sets (Extended Data Fig. 3).
For the 8 pcw foetus, 3274 private SNVs were included in the bait set. Of these, 1431 achieved a depth of ≥8× (minimum depth set) in the colony in which they were originally called, and 34 a depth ≥40× (high depth set). Otherwise analysis proceeded similarly as per the 18 pcw HSPC colonies.
Indels were included in the 8 pcw capture panel only. There were 246 indels in total: 196 private and 50 shared. Of these, 73 private and 33 shared indels on autosomes achieved a total depth of ≥8× in colonies in which they were called. Applying the binomial mixture model to these read counts showed that counts for both private and shared mutations were consistent with a single binomial distribution with probabilities of 0.49 and 0.475 respectively. Mutation calling was performed using a depth-aware naive Bayesian classifier comparing the two competing models:

Detection of somatic mutations discovered in HSPC colonies in non-
H0: Observed VAF is consistent with sequencing error; versus HA: Observed VAF is consistent with the mutation being truly present in the sample.
The data comprised counts of total depth, $,& , and number of reads, $,& , for the th mutation in the th sample. We estimated the sequencing error rate for the th mutation as coming from a ( 3 , 3 , $,& ) distribution, using deep sequencing data from colonies that were known to be negative for the mutation by their position on the phylogenetic tree. Estimates were derived using the method of moments, empirically weighted for differences in sequence depth across colonies as described by Kleinman 31 . The data for a true mutation were assumed to be drawn from a ( 6 , 6 , $,& ) distribution with 6 = 1.5; 6 = 10. The prior probability of a mutation being present in a tissue, $,& , was set as the proportion of HSPC colonies from the WGS phylogeny harbouring that mutation. Using this and the usual betabinomial probability density function, then, the posterior probability for HA is: For aggregated tissues, the posterior probability of the mutation being present in the tissue as a whole was calculated as one minus the product of the probabilities of the mutation being absent in each individual microbiopsy. The knowledge of the phylogeny structure was then used to clean up calls that did not concord with mutation calls from neighbouring branches.
Mutations that were called as present from distal branches when their ancestral branches were absent were removed; mutations which had been called as absent, but had confident mutation calls from both distal and proximal branches (or from the same branch) had their posterior probabilities elevated. This latter scenario usually occurred for mutations where the depth was unusually low and no positive reads had been found in a given tissue. Trees for figures 3 and 4 were then plotted with the VAF of individual mutations called as present overlayed on the underlying HSPC phylogeny.

Inference of numbers of lineages at different developmental stages
The presence of mutations from the HSPC phylogeny in non-haematopoietic tissues demonstrates that the mutation was acquired in a blood antecedent cell that still had the capacity to differentiate into both blood and the non-haematopoietic tissue in question. By considering a phylogeny including only those branches shared with a specific nonhaematopoietic tissue you can visualize the phylogeny of these multipotent cells. The number of descendant non-shared branches stemming directly from this shared phylogeny is a direct read of the minimum number of more committed HSPC antecedents arising from these multipotent antecedents.
For the early lineage commitment events (ICM commitment and epiblast commitment) our HSPC phylogeny captures the vast majority of antecedents from these time points, and therefore the estimates are likely close approximations of the true values. For the later commitment events, the phylogeny only captures a small fraction of antecedents, and therefore the numbers represent lower bounds on the true value.

Clustering of targeted sequencing results from non-haematopoietic tissues
To evaluate the similarity of the pattern of shared mutations and their VAFs in the different non-haematopoietic tissues we used the soft cosine similarity statistic, a measure used extensively in natural language processing. In this measure, the similarity between two sets of observations is defined as: Here the i and j represent an iteration over all mutations; the ai and bj represent the natural log of the VAF of mutation i in tissue a versus mutation j in tissue b respectively; the sij represents a similarity score for mutation i and j which for our analysis we considered to be the inverse of the 1 + the minimum total branch distance between the two mutations.
The euclidean distance between tissues was then calculated, and the tissues clustered by hierarchical clustering using the hclust function from the "stats" package within R. The calculated dendrogram is displayed alongside the heatmap in Figure 3d.

Measure of lineages lost up to the 4th generation from the zygote
In the 8pcw foetus, we attempted to assign individual mutations to the round of cell division during which they originated. For the first two generations from the zygote (up to the 4-cell stage embryo) it was straight-forward to directly infer this with some confidence because (1) divisions are invariably associated with mutations due to the relatively high mutation rate i.e. Once mutations were assigned to each generation, the total cell fraction captured by our HSPC phylogeny for mutations acquired in the 1st -4th cell divisions was calculated by summing the cell fractions for each branch in a given generation. The uncertainty in these cell fractions was calculated by bootstrapping the read counts from each branch 1000 times, repeating the sum of cell fractions across branches for each bootstrap, and obtaining the 0.025 and 0.975 quantiles from the resulting distribution for a 95% percentile interval.

Construction of the phylogenies with branch lengths scaled by generations
As described above, assignment of branches to specific generations is relatively straightforward for the first three generations. However, this becomes more challenging once there are frequent polytomies and fewer descendant cells from each branch.
To create an approximation of a "generation" tree where branches are placed along the y axis according to the generation in which the mutations arose we did the following: (1) Branches were manually assigned to generations as previously for the first 3 generations (2) Branches from polytomies were extended according to log2 of the number of descendants e.g. for an 8-degree polytomy, each branch would be extended to length 3, the average number of generations. In reality, in this example, individual branches may result from 1 to 7 generations from the antecedent, depending on the true phylogeny structure underlying the polytomy.
(3) Terminal branches were extended to make the tree ultrametric.
For the analysis of the lineages represented through different generations (Figure 4a-b), single branches from polytomies were therefore considered as belonging to multiple generations.
This approach means that the absolute proportions of represented lineages by generation are inaccurate as different assumptions will falsely increase/ decrease these proportions.
However, for comparison between tissues it remains a useful measure.

Analysis of mutation distribution in the genome
To evaluate whether the mutation distribution across different genomic features was comparable to that expected by chance we compared our data to a simulation of random mutagenesis within the human genome.
In this simulation the bedtools random command was used to generate 160,000 random sites within the human genome. From these, 5000 were subsampled such that the 32 possible trinucleotide contexts were represented in the same proportions as in the data. The central pyrimidine within each trinucleotide was then virtually "mutated" to an alternative base, again according to probabilities matching the mutational signatures of the data to generate a vcf file of 5000 mutational-signature matched mutations. The resulting genomic locations were annotated using the VAGrENT algorithm (https://github.com/cancerit/VAGrENT) whereby each mutation is assigned a Sequence Ontology term to classify its consequence.
This simulation was repeated 500 times and the distribution of the resulting proportions of mutations in different Sequence Ontology categories was compared to those observed in the true mutation sets.

Analysis of molecular variance to look for phylogenetic clustering by anatomic location
Analysis of molecular variance (AMOVA) was used to test for phylogenetic clustering by anatomical location in the 18 pcw foetus. We compared HSPCs that had been isolated from the liver, femur 1, and femur 2.
The phylogeny was first made ultrametric using a bespoke method that gives more weight to the lengths of early branches. The mutational distance between any two HSPC colonies was then calculated (this is the minimum mutational "walk" to get from one sample to the other) for all possible colony-colony pairs. The sum of squared distances for "within location" and "between location" pairs was calculated, divided by their degrees of freedom, and this was used to obtain an estimate of the proportion of variation explained by differences in cell locations (the F statistic). P values were obtained by randomly re-labelling the cell locations 30,000 times and calculating the same statistic. The given p value is therefore a proportion of the random re-assignments that had a more extreme F statistic than that observed in the data.   Extended Data Fig. 9 | High level of intermixing of 18