Integrative Single-cell RNA-Seq and ATAC-Seq Analysis of Human Developmental Haematopoiesis

Regulation of haematopoiesis during human development remains poorly defined. Here, we applied single-cell (sc)RNA-Seq and scATAC-Seq analysis to over 8,000 human immunophenotypic blood cells from foetal liver and bone marrow. We inferred their differentiation trajectory and identified three highly proliferative oligopotent progenitor populations downstream from haematopoietic stem cell/multipotent progenitors (HSC/MPPs). Along this trajectory, we observed opposing patterns of chromatin accessibility and differentiation that coincided with dynamic changes in the activity of distinct lineage-specific transcription factors. Integrative analysis of chromatin accessibility and gene expression revealed extensive epigenetic but not transcriptional priming of HSC/MPPs prior to their lineage commitment. Finally, we refined and functionally validated the sorting strategy for the HSC/MPPs and achieved around 90% enrichment. Our study provides a useful framework for future investigation of human developmental haematopoiesis in the context of blood pathologies and regenerative medicine.


Introduction
During embryonic development, haematopoietic stem cells (HSCs) need to rapidly differentiate into mature blood cells. Our current knowledge of foetal haematopoietic stem and progenitor cells (HSPCs) has been mainly advanced by murine and in vitro model systems. It has been demonstrated that foetal haematopoiesis consists of several, separate waves of specification, migration, and differentiation of rare HSCs at distinct organs during development (Ivanovs et al., 2017). In humans, definitive haematopoiesis starts with the appearance of HSCs within haematopoietic clusters, in the dorsal aorta, at 27 days post-conception. These definitive HSCs first colonise the foetal liver at 4 post-conceptional weeks (pcw) where they expand in numbers. At 10.5 pcw, the haematopoietic site shifts once more to the cavities of bones (i.e., bone marrow), where adult haematopoiesis is established permanently. The first HSCs that seed the bone marrow are thought to continue to rapidly increase in numbers before undergoing a dramatic change in their proliferative and differentiation properties to accommodate the need for high production of differentiated progeny (Mikkola and Orkin, 2006).
Historically, differentiation processes in the haematopoietic system have been depicted as a series of intermediate steps, defined by panels of cell surface markers (i.e., cluster of differentiation, CD). In this model, often represented as a "haematopoietic tree", HSCs give rise to increasingly lineage-restricted cell types, eventually leading to mature blood cells (Akashi et al., 1999) (Weissman, 2000). This paradigm shifted in the last five years with several studies reporting the transcriptomes of thousands of single haematopoietic cells, isolated by cell surface markers, both in the mouse model and in adult humans (Paul et al., 2015) (Velten et al., 2017. These reports showed that progenitor populations, previously thought to be homogeneous, are actually very heterogeneous on the transcriptional level. The mechanisms underlying early fate decisions in HSCs are largely unknown. It has been postulated that the stochastic expression of lineage-specific transcription factors (TFs) above the noise threshold can "lock" a cell into a distinct cell fate (Graf and Enver, 2009). In line with this, co-expression of genes associated with antagonistic lineages, including key TFs, have been observed in multipotent haematopoietic cells, albeit at low levels (Hu et al., 1997) (Miyamoto et al., 2002). This points towards the presence of sub-populations of cells within the multipotent compartment that are permissive for opposing cell fates prior to their lineage commitment, a phenomenon referred to as priming (Nimmo et al., 2015). More recently, singlecell RNA sequencing (scRNA-Seq) of human HSPCs introduced a different concept of priming.
Studies of the adult bone marrow and foetal liver haematopoiesis identified sub-populations of haematopoietic stem cells and multipotent progenitors (HSC/MPPs) with a coordinated expression of marker genes, specific for distinct uni-lineage differentiation programmes, that gradually increased along all differentiation branches (Velten et al., 2017) (Popescu et al., 2019. In addition, there are some indications that lineage priming in the HSC compartment might be happening not only on the transcriptional but also at the epigenetic level (Nimmo et al., 2015). Data from single-cell Assay for Transposase Accessible Chromatin sequencing (scATAC-Seq) of phenotypic HSPCs from the adult human bone marrow show that phenotypic multipotent progenitors have variations in chromatin accessibility consistent with a bias towards erythroid and lymphoid lineages (Buenrostro et al., 2018).
Here we performed an integrative analysis of scRNA-Seq and scATAC-Seq of more than 8,000 immunophenotypic HSPCs, from 17-22 pcw human foetal liver, femur, and hip to define transcriptional and epigenetic changes during blood differentiation. We explored lineage priming at the transcriptional and chromatin level in HSC/MPPs and refined the sorting strategy for the isolation of a highly enriched HSC/MPP population.

Single-cell transcriptome of the haematopoietic compartment in human foetal liver and bone marrow
To capture the full repertoire of haematopoietic cells during foetal development, we single-cell sorted phenotypically defined blood populations, from matched (i.e., from the same individual) foetal livers, femur, and hip (iliac) bones, between 17 and 22 pcw ( Figure 1A). Cells from the liver, hip, and femur were sorted and processed independently in all experiments. Thus, each cell can be traced back to the foetus and organ it came from. We used a hierarchical approach, where we first isolated non-committed (Lin-[CD3, CD8, CD11b, CD14, CD19, and CD56] CD34+ CD38-) progenitors, that contain all immature haematopoietic populations and are present at the frequency of less than 0.1% of the total foetal bone marrow (Golfier et al., 2008), followed by a more restrictive panel to capture differentiated and mature cell types. We next isolated committed (Lin-, CD34+ CD38+) progenitors as well as phenotypic HSCs, multipotent progenitors (MPPs), common myeloid progenitors (CMPs), megakaryocyte-erythroid progenitors (MEPs), granulocyte-monocyte progenitors (GMPs), and common lymphoid progenitors (CLPs). In addition, based on broad phenotypic markers, we sorted T cells, NK cells, innate lymphoid cells (ILCs), monocytes, dendritic cells, mast cells, basophils, neutrophils, eosinophils, erythroid progenitors, erythrocytes, immature megakaryocytes  Figure 1B). Finally, we identified a cluster of mature B cells, expressing high levels of IGHM and decreased levels of IGLL1, compared to pro/pre B clusters ( Figure 1B). We did not detect any T cells or ILCs in the liver or in the femur in spite of sorting phenotypic T cells and ILCs using broad cell surface markers for these populations. Unlike B cells that mature in the BM, T cells derive from lymphoid progenitors that migrate from the BM to the thymus, where they complete their maturation. The development of ILCs is less understood but there have been suggestions that ILC precursors migrate early on from BM into non-haematopoietic tissues, e.g., gut (Cichocki et al., 2019). Since we only sorted BM and not thymus or gut, we might have captured only progenitors but not T cells and ILCs. By using a Deep Neural Network (DNN) (LeCun et al., 2015) and the top 30 marker genes for each cluster, we were able to correctly classify the cells to the prospective clusters with 90.46% accuracy, confirming that our manual annotation of clusters separated well the distinct cell types/states (see Methods, Supplementary figure 4A).
In the last decade, human HSCs and other progenitor populations have been isolated and used in functional assays based on specific sets of cell surface markers. It has been suggested that the foetal haematopoietic progenitor compartment differs substantially from its adult counterpart (Notta et al., 2016). Our approach allowed us to compare the extent to which the phenotypic identity of cell populations (as defined by CD markers) matched their transcriptional state, i.e., our manually curated clusters and thus to critically examine the use of CD markers in the context of foetal bone marrow haematopoiesis.
Single-cell analysis revealed substantial transcriptional heterogeneity within all immunophenotypically-defined stem and progenitor populations, with some phenotypic progenitor populations such as HSCs, MPPs, CMPs, GMPs, MEPs, and CLPs being comprised of more than ten different transcriptionally-defined populations. ( Figure 1C, Supplementary figure 4B). This observation is in agreement with recent research showing a high level of heterogeneity of the progenitor compartment of human cord blood (Knapp et al., 2018). Taken together, our comparative analysis shows that currently used cell-surface markers are a poor predictor of the transcriptional state of human foetal haematopoietic progenitors.

Inference of differentiation trajectories during foetal haematopoiesis
Next, we used a Force-Directed Graph drawing algorithm, ForceAtlas2, to infer the differentiation trajectory of haematopoietic cells during human foetal development (Jacomy et al., 2014). We initialised a ForceAtlas2 layout with Partition-based Approximate Graph Abstraction (PAGA) coordinates from our annotated cell types (Wolf et al., 2019). This initialisation generated an interpretable single-cell embedding that is faithful to the global topology. The obtained global topology revealed HSC/MPPs at the tip of the trajectory ( Figure   2A-B, Supplementary figure 3). HSC/MPPs showed high expression of MLLT3, a crucial regulator of human HSC maintenance (Calvanese et al., 2019), HLF, a TF involved in preserving quiescence in HSCs (Komorowska et al., 2017) and MEIS1, a TF involved in limiting oxidative stress in HSCs, which is necessary for quiescence (Unnisa et al., 2012) (Wang et al., 2018. Cells in this cluster also expressed high levels of surface markers of HSPCs such as CD34, (Morisot et al., 2006), SELL (Ivanovs et al., 2017), and PROM1 (de Wynter et al., 1998) (Saha et al., 2020 (Figure 1B and 2C). Downstream of HSC/MPPs, we identified three distinct, highly proliferative, oligopotent progenitor populations. We used Scanpy's dpt function to infer the progression of the cells through geodesic distance along the graph. Then, we used Scanpy's paga_path function to show how the gene expression and annotation changes along the three main paths (MEMPs, GP, and LMPs) that are present in the abstracted graph ( Figure 2C).
MEMPs connected HSC/MPPs with megakaryocytes, erythroid, and mast cells. In line with this, differentially regulated genes in the HSC/MPPs transition to MEMPs included megakaryocyte/erythroid/mast cells lineage-specific genes such as GATA1, ITGA2B, PLEK, KLF1,HDC,and MS4A3,(Figure 1B,2C,Supplementary figure 5B). Presence of MEMPs in our dataset is consistent with studies in mouse models proposing a common trajectory between erythroid, megakaryocytic, and mast cell lineages (Franco et al., 2010). This concept was more recently supported by a study in human foetal liver showing a shared progenitor of megakaryocyte, erythroid, and mast cells (Popescu et al., 2019). In addition, we identified a proliferative population of MEMPs-Cycle of which ~92% were in the G2M/S phase compared to 65% of MEMPs ( Figure 2E). MEMPs-Cycle population further upregulated erythroid-specific genes such as KLF1, BLVRB, and TFRC compared to MEMPs suggesting their gradual commitment towards erythroid lineage (Supplementary figure 5C).
GPs connected the HSC/MPP cluster with granulocyte clusters. Cells in this cluster differentially expressed myeloid lineage-specific genes (e.g., AZU1, LYZ, and MPO) compared to HSC/MMPs ( Figure 2C, Supplementary figure 5D and were highly cycling, with 73% of cells in the G2M/S phase ( Figure 2E). Finally, our data pointed towards the existence of a common progenitor population for B cells, monocytes, pDCs, and NK cells, here annotated as lymphoid-myeloid progenitors (LMPs). Cells in this cluster expressed genes specific to those lineages, including IGLL1, HMGB2, and CD79B (lymphoid) ( Figure  Our findings support previous studies on early lymphoid commitment in human cord blood, both in vitro and in vivo, which identified a shared lineage progenitor between lymphoid, NK, B, and T cells, monocytes, and dendritic cells (Doulatov et al., 2010) (Collin et al., 2011).
Interestingly, the LMP cluster had higher expression of MPP-related genes such as SPINK2, CD52, and SELL compared to MEMPs, suggesting that these progenitors represent a more immature population compared to MEMPs (Supplementary figure 5F).
Next, we used the Python implementation of Single-Cell rEgulatory Network Inference and Clustering (SCENIC) (Aibar et al., 2017, Van de Sande et al., 2020 to identify master regulators and gene regulatory networks (GRN) in HSPC and mature blood cells across differentiation trajectories. We found 162 regulons of which some were enriched across many different cell types, often as a part of the particular differentiation branch, and some were celltype specific ( Figure 2D). We identified HLF and HOXA9 as main regulons in HSC/MPPs, whereas GATA1, GATA2, and TAL1 were identified in the MEMP branch of the haematopoietic tree ( Figure 2D). FOXO3 was highly specific for erythroid cells whereas EOMES, OLIG2, and IRF8 for NK cells, monocytes, and pDCs, respectively. Importantly, the regulons confirmed the inferred differentiation trajectory.
To further explore heterogeneity within the HSC/MPP population we examined whether HSC/MPP cells simultaneously primed several different lineage-affiliated programs of gene activity. While HSC/MPPs sporadically expressed lymphoid, myeloid, or megakaryocyteerythroid differentiation genes, we did not observe consistent expression of antagonistic lineage-affiliated genes in individual cells. In addition, after further sub-clustering the HSC/MPPs, there was no evident consolidation of lineage-affiliated transcriptional programs in any of the sub-populations (Supplementary figure 6). Our scRNA-Seq data, thus, do not support recently reported transcriptional lineage priming in the foetal HSC/MPP compartment (Popescu et al., 2019) and suggest that, transcriptionally, our HSC/MPP cluster represents a highly immature population of cells. Previous research showed that, contrary to adult blood progenitors that are mainly unilineage, foetal liver blood progenitors maintain multilineage potential (Notta et al., 2016). Our data are consistent with this observation and point towards the existence of three oligopotent progenitor populations downstream of the HSC/MPP compartments: MEMPs giving rise to erythroid, megakaryocytes, and mast cells, GPs differentiating into granulocytes and LMPs generating lymphoid, monocytes and dendritic cells.

scATAC-Seq of foetal non-committed progenitors (CD34+ CD38-)
Detection of low abundant transcripts, such as TFs, might be difficult in scRNA-Seq data due to technical limitations of the approach, leading to false negatives (so-called drop-outs). The activity of these TFs can be inferred, however, from chromatin accessibility, emphasizing the importance of approaches integrating scRNA-Seq and scATAC-Seq data. In addition, chromatin accessibility at regulatory regions might precede gene activity and thus have predictive value for future transcription of a gene. Therefore, to further investigate the regulatory events in the very immature cell populations, we examined the single-cell chromatin accessibility landscape (using scATAC-Seq) of human foetal Lin-CD34+ CD38-cells (see Methods). We sequenced 4,001 cells from the liver and femur of three foetuses, 18, 20, and 21 pcw (see Methods). Based on our scRNA-Seq data, we expected that 90% of captured cells would be associated with one of the six populations: HSC/MPPs, HSC/MPPs-Cycle, MEMPs, MEMPs-Cycle, GPs, and LMPs, with HSC/MPPs(Cycle) constituting the majority (Supplementary figure 4B).
To capture peaks that are present in less abundant cell types such as MEMPs, MEMPs-Cycle, GPs, and LMPs, we employed an iterative peak-calling approach. We first defined open chromatin regions by pooling all the data and calling peaks in the pooled samples. Following dimensionality reduction with Diffusion maps (Haghverdi et al., 2015) and clustering using the Louvain community detection algorithm (Blondel et al., 2008), we performed a second round of peak calling in the clusters with more than 50 cells. Out of the initial ~474,000 reads, after

Motif accessibility dynamics along the inferred differentiation trajectories
In order to merge samples and remove the batch effects, we applied Harmony (Korsunsky et al., 2019;Luecken et al., 2020) on the first 50 Latent Semantic Indexing (LSI) components, excluding the first one because it was highly correlated to the sequencing depth (Supplementary figure 2P-Q). By using a shared nearest neighbour (SNN) modularity optimization based clustering algorithm, we obtained seven distinct clusters of differentially accessible peaks ( Figure 3A).
To explore the chromatin accessibility profiles across the seven clusters, we examined the accessibility of selected marker genes from our scRNA-Seq data ( Figure 3B). We observed higher accessibility of marker genes associated with stem cells (e.g., MLLT3, PROM1, FLI1, and GATA2) and lower accessibility of genes associated with distinct lineages (e.g., MPO, ALAS2, MPEG1, and CD19), keeping in line with the undifferentiated nature of sorted cells ( Figure 3B). Interestingly, we observed a clear separation of clusters in terms of their overall accessibility of marker genes, with clusters 1, 2, 4, and 7 being more accessible and clusters 3 and 5 being less accessible. Cluster 6 had a mixed signature ( Figure 3B).
Extensively open chromatin in multipotent cells has been previously associated with a permissive state to which multiple programmes of gene regulation may be applied upon differentiation and is considered important for the maintenance of pluripotency (Gaspar-Maia et al., 2011). To further investigate if there were global dynamic changes in accessibility patterns associated with the differentiation of foetal HSC/MPPs, we inferred differentiation pseudotime from our scATAC-Seq data using the same approach as with scRNA-Seq described above. Briefly, we built a Force-directed Graph from our seven scATAC-Seq clusters by initialising a ForceAtlas2 layout with PAGA coordinates ( Figure 3C-D). The generated trajectory revealed two branches with a clear trend between chromatin accessibility and differentiation in each branch ( Figure 3D-E). We observed the highest accessibility in clusters 1, 2, and 4 that gradually decreased towards the tips of the two branches (i.e., clusters 1-2 -3 on one side, and 1-4-5-6 and 1-6 on the other), ( Figure 3E). This result is compatible with the notion that clusters 1, 2, and 4 represent HSC/MPP population.
Control of gene expression is a dynamic process that involves both the cell-type-specific expression of TFs and the establishment of an accessible chromatin state that permits binding of TFs to a defined motif. Thus, to assess regulatory programs that are active in HSPCs, we used chromVAR (Schep et al., 2017) to calculate the most variable accessible TF sequence motifs in different clusters and examine their activity along the differentiation trajectory. Along the two branches identified by the trajectory inference, we observed dynamic changes in the accessibility of lineage-specific haematopoietic TF motifs such as GATA1, TAL1, KLF1, HTF4, ID4, IRF8, TFE2 ( Figure 3F-H).
GATA1 activity ( Figure 3G-H) and gene-body accessibility ( Figure 3B) were enriched in cluster 6. GATA1 is known to be an important regulator of erythroid, megakaryocytic, and mast cells differentiation (Katsumura et al., 2017) and was exclusively expressed in the MEMP cluster, in our scRNA-Seq dataset. Thus, the identified trajectories between clusters 1-6 and clusters 1-4-5-6 most likely represent the MEMP differentiation paths ( Figure 3D). Interestingly, in cluster 6, compared to clusters 2 and 3, we detected opposing patterns of motif accessibility for the two different TAL1 binding sites (TAL1.0.A and TAL1.1.A, respectively) ( Figure 3F-H). Substantial changes in occupancy by TAL1 during differentiation have been observed, which are dependent on its binding partners (Wu et al., 2014). It has been previously reported that TAL1.0.A was co-occupied by TAL1 and GATA1 (Kassouf et al., 2010) whereas TAL1.1.A by TAL1 and TCF3 (Hsu et al., 1994). Our analysis, thus, revealed that the two different TAL1 binding motifs are active in distinct haematopoietic progenitor populations during foetal haematopoiesis ( Figure 3F-H).
Clusters 2 and 3 also showed increased activity of CEBPD and IRF8, crucial for myeloid and dendritic cell differentiation and of ID4 and HTF4, which are involved in the establishment of the lymphoid lineage (Miyazaki et al., 2017) (Figure 3F-H). This points towards clusters 1, 2 and 3 forming a common initial trajectory between the myeloid and lymphoid fate, consistent with our observations in scRNA-Seq data. Cluster 1 and 4 were characterised by a high level of activity of TFs of the NF-kB pathway (i.e., NF-kB2, REL, and RELB), ( Figure

Integrating scRNA-Seq and scATAC-Seq data
Next, we wanted to map the cells from our scATAC-Seq data to specific cell types. Since, currently, no chromatin accessibility maps are available for human foetal HSPCs, we chose a strategy to integrate our scRNA-Seq and scATAC-Seq by mapping cells based on their gene body accessibility. We used a recently developed method which identifies pairwise correspondences (termed "anchors") between single cells across two different types of datasets, and their transformation into the shared space (Stuart et al., 2019). This approach allowed us to transfer scRNA-Seq derived annotations, learned by a classifier, onto scATAC-Seq data (see Methods).
The frequency of assigned cell types in the scATAC-Seq data set was highly concordant with the ones from scRNA-Seq data (Supplementary figure 4B) suggesting that overall the two modalities, i.e., chromatin accessibility and transcriptome are correlated. To validate the cell type assignment of scATAC-Seq cells, we examined the accessibility of selected lineagespecific TF motifs in each of the annotated cell types ( Figure 4B). In line with the predicted annotations, the GATA1 motif showed the highest accessibility in MEMPs and MEMPs-Cycle, whereas TEF2 (known to play a role in myeloid and lymphoid differentiation, (Miyamoto et al., 2002)) was most active in GPs and LMPs. Confirming our earlier observation, two distinct TAL1 motifs had anticorrelated accessibility. TAL1.0.A was preferentially active in MEMPs and MEMPs-Cycle, while TAL1.1.A in GPs and LMPs ( Figure 4B).
The Force Atlas representation of the classified scATAC-Seq cells revealed, however, considerable intermixing of different cell types across the trajectory with enrichment of MEMPs/MEMPs-Cycle in cluster 6 and to a lesser extent of GPs and LMPs in clusters 2 and 3 ( Figure 4C). HSC/MPPs(Cycle) were distributed across all seven clusters. This wide distribution of HSC/MPPs(Cycle) across multiple clusters within scATAC-Seq data suggested that, even though chromatin accessibility and the transcriptional state of foetal HSC/MPPs are correlated, there is extensive chromatin priming in the HSC/MPP population that results in their heterogeneity.
Next, we compared the accessibility of selected lineage-specific TF motifs in HSC/MPPs across the seven clusters ( Figure 4D-G). We observed a low level of activity of all examined TFs in cluster 1 followed by a statistically significant increase of HTF4, ID4, and TFE2 and decrease of GATA1, in HSC/MPPs in clusters 2 and 3. GATA1 activity, however, increased in HSC/MPPs in cluster 6. Our data suggest that, within the transcriptionally homogeneous population of HSC/MPPs, there are significant differences in the activity of specific TFs that may precede gene expression and mark initial priming of HSC/MPPs prior to their commitment to the specific lineage. To explore further this "time lag" between the chromatin accessibility and gene expression during differentiation we examined scRNA-Seq and scATAC-Seq data for the top GATA1-regulon target genes (ranked based on the AUCell score) identified by pySCENIC ( Figure 5). We looked at the accessibility of both gene promoters (+/-3 kb from TSS) and distal regulatory regions (+/-50 kb from the TSS) as well as the expression levels of the selected target genes along the MEMP differentiation trajectory ( Figure 5A). We observed that promoters of GATA1-regulon target genes were often open in HSC/MPPs prior to any noticeable gene expression ( Figure 5A). Thus, in line with our previous observation, chromatin accessibility in HSC/MPPs preceded transcriptional changes that were only present in more differentiated cells. Interestingly, promoter accessibility of GATA1 target genes was overall lower in cluster 6 (MEMPs) compared to cluster 1 (HSC/MPPs) ( Figure 5B, D, E), and coincided with lower promoter co-accessibility of the antagonistic genes (i.e. genes that are specific for distinct lineages), ( Figure 5F). In contrast, the accessibility of distal regulatory elements/enhancers had higher accessibility in cluster 6 compared to cluster 1 ( Figure 5C).
This may indicate that in GATA-regulon genes may be primed at promoters while it is the enhancers that contribute the cell-type-specific expression.

Validation of HSC/MPP identity and their differentiation capacity
Given the observed limitation of commonly used sorting markers to isolate pure progenitor populations, we devised a new FACS sorting strategy for HSC/MPPs based on cell-surface markers selected from the top 20 marker genes for this cluster in our scRNA-Seq dataset. The To assess the differentiation potential and robustness of the lineage output of CD-REF cells we sorted individual cells from three foetuses on either mouse MS5 feeder layer or on a more physiologically relevant, primary human foetal mesenchymal stem cells (fMSCs) ( Figure 6C, see Methods section). After two weeks, 80% of cells sorted on MS5 and 85% of cells sorted on human foetal fMSCs generated colonies. In total, we analysed 201 colonies for their size and lineage output (erythroid-Ery, myeloid-My, megakaryocytic-Mk, lymphoid-Ly) using FACS (see Methods and Supplementary figure 7A). Our FACS analysis revealed that 7% of colonies on MS5 and 8% on fMSCs were quadri-lineage, 43% and 31% were tri-lineage, 29% and 25% were bilineage, 20% and 28% were unilineage and 1% and 8% were undifferentiated colonies, respectively ( Figure 6D).

Comparative analysis of HSC/MPP cells from different haematopoietic organs
Cells in the HSC/MPP cluster originated from the liver, femur, and hip. This provided a unique opportunity to assess potential qualitative and quantitative differences in the HSC/MPP population that originated from foetal liver or bone marrow. We first applied Fisher's exact test on the number of liver and femur cells in the different cell cycle states to determine whether there are non-random associations between the cycle state and the organ of origin (see Methods for further details). Interestingly, there was a statistically significant difference (p-value=4.25⨉10 -9 ) in the cell cycle state of cells in the HSC/MPP cluster between femur and liver ( Figure 7A). Cells in the femur were predominantly in G1 (~70% of cells) compared to the same population in the liver (~52%) ( Figure 7B). These data suggest that HSC/MPPs become more quiescent as they migrate from the liver to the bone marrow during the second trimester of human development. In line with this, HSC/MPP cells were significantly less frequent in femur compared to the liver ( Figure 7D), as confirmed by Fisher's exact test on the total number of liver and femur cells ( Figure 7E). This is in agreement with the increased proportion of phenotypic non-committed progenitors (CD34+ CD38-) found in the liver compared to the bone marrow ( Figure 7C). Using Mki67 and DAPI staining we quantified the proportion of cells in the different stages of the cell cycle: G0 (Mki67-DAPI-), G1 (Mki67+DAPI-), S-G2-M (Mki67+DAPI+), as previously described (Kim and Sederstrom, 2015). Our analysis showed that the CD34+CD38-population is less cycling in both foetal liver and femur compared to CD34+CD38+ population ( Figure 6E-F). Furthermore, we further showed that the vast majority of CD-REF cells are in G0/G1 in both femur and liver but that nearly twice as many cells are in S-G2-M in the liver compared to the femur ( Figure 6G).
In order to evaluate if there is a statistically significant difference in the number of expressed genes between HSC/MPPs collected from the liver and femur, we used both the Kolmogorov-Smirnov (KS) and Mann-Whitney-Wilcoxon (MWW) test. We applied a subsampling strategy to downsample the cluster with more cells and balance the two distributions (see Methods).
KS and MWW revealed a statistically significant decrease in the number of expressed genes in HSC/MPPs in the femur compared to the liver ( Figure 7F-G). Gene-set enrichment analysis, using pathway databases, of differentially expressed genes between the liver and femur revealed that HSC/MPPs in the femur up-regulate genes involved in nucleosome assembly, chromatin assembly, and DNA packaging such as HIST1H1E and HIST1H2BN, possibly marking their entry into quiescence ( Figure 7H). Interestingly, DE analysis of genes that encode membrane proteins revealed statistically significant upregulation of genes related with actin cytoskeleton remodelling, cell adhesion, and migration (e.g., JAML, SELPLG, LCP1, MSN, RHOA) in HSC/MPPs in the liver compared to the femur ( Figure 7I). This would be in line with the higher propensity of liver HSC/MPPs to migrate to other tissues such as bone marrow. In addition, we detected higher expression of interferon-induced gene IFITM1 in foetal liver, known to play a role in the transduction of antiproliferative and adhesion signals ( Figure   7J). This shift of HSC/MPPs from highly proliferative to quiescent as well as downregulation of genes involved in actin cytoskeleton remodelling, as they migrate from foetal liver to bone marrow, signifies the role of the niche in the modulation of HSC/MPPs behaviour.

Discussion
Here, we present an integrative analysis of the single-cell transcriptome and chromatin accessibility of human foetal HSPCs. Our strategy involved plate-based sorting of well-defined immunophenotypic HSPCs from matched foetal liver and bone marrow. This approach enabled us to go beyond cataloguing heterogeneity of cellular states during foetal haematopoiesis and to: i) examine the extent to which phenotypic markers used over the last decade coincided with the true nature of the sorted foetal blood populations, ii) refine the sorting strategy for HSC/MPPs, iii) identify cell-cycle and gene expression differences between HSC/MPPs from foetal liver and bone marrow, iv) infer HSPCs differentiation trajectory, and v) explore lineage priming within the HCS/MPP population.
In doing so, we observed a striking level of heterogeneity in all immunophenotypic HSPCs, with more than ten transcriptionally-defined cell populations identified in each of the progenitor Overall, our study has provided a high resolution transcriptional and chromatin accessibility map of foetal HSPCs from the liver and bone marrow that will be essential for further exploration of HSC/MPPs in the context of blood pathologies and for the purpose of regenerative medicine.

Limitations of the study
In this study, we characterised human foetal liver and bone marrow haematopoiesis using a combination of single-cell transcriptomics/epigenetics and in vitro single-cell differentiation assays. In order to avoid perturbations caused by freezing and thawing cycles, all experiments were performed on freshly-isolated tissues. Such experimental design and the nature of analysed tissues come with a few limitations: 1) samples are rare and 2) the cellularity varies significantly between different stages of development and individual foetuses, especially in the bone marrow. As a result, the number of cells available for the analysis was limited. For this reason, we were not able to obtain enough cells to perform xenotransplantation experiments to confirm the stem cell identity and self-renewing potential of our CD-REF cells, collected from bone marrow. Instead, we used single-cell in vitro assays as an alternative, albeit not optimal, readout of the multilineage potential of a cell. In addition, we could only collect a limited number of distinct phenotypically defined populations from individual foetuses and, therefore, for any given population, the number of analysed samples was relatively low.

Declaration of Interests
The authors declare no competing interests.

Lead Contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the Lead Contact, Dr Ana Cvejic (as889@cam.ac.uk).

Materials Availability
This study did not generate new unique reagents.

Data and Code Availability
The raw RNA-Seq data (i.e., fastq files) and cell assignment are deposited at ArrayExpress with accession code E-MTAB-9067, while the raw ATAC-Seq data (i.e., fastq files) and cell assignment are deposited at ArrayExpress with accession code E-MTAB-9068.
All scripts, functions, and Jupyter Notebook developed for this study are freely available on GitLab: https://gitlab.com/cvejic-group/integrative-scrna-scatac-human-foetal. The repository also contains the gene expression and fragment matrices.

Ethics and Tissue acquisition
Human foetal bone and liver samples were obtained from 33 foetuses aged 17-22 pcw, following termination of pregnancy and informed written consent. The human foetal material

Tissue processing
Tissues were kept in cold DMEM medium (Invitrogen) until dissection and processed on the same day of collection. Single-cell suspensions were generated from matched foetal liver and bone tissues after rinsing them with cold PBS (Gibco). Liver samples were passed through a 70 µm strain into a falcon tube prefilled with cold PBS. Bone marrow from long bones was isolated by flushing cold PBS into the diaphysis and collected into a falcon tube. Bone marrow from hip bone was collected by dissecting the bone with a sterile scalpel and flushing cold PBS in the marrow cavity into a falcon tube. The suspension obtained from long bones and hip bones was then passed through a 70 µm strain into a new falcon tube. Cells were then centrifuged for 5 minutes at 300 g, 4°C and the pellet was resuspended into the RBC lysis buffer (eBioscience) for 2 minutes at room temperature, after which 20 ml of cold PBS were added to stop the lysis reaction. RBC step was not performed when sorting erythroid cells.

Fluorescence-activated cell sorting
Cells were stained with antibody cocktails in a total volume of 100 µl 5% FBS (Gibco) in PBS for 30 minutes at 4°C, centrifuged for 5 minutes at 300 g, 4°C, resuspended in a final volume of 500 µl of 5% FBS in PBS and subsequently filtered into polypropylene FACS tubes (ThermoFisher). For scRNA-Seq experiments, single cells were index sorted using a BD Influx Sorter into wells of 96-well plates (4titude) prefilled with 2 µl of lysis buffer consisting of 0.2% Triton X-100 (Sigma) and 1 U/µl RNAse inhibitor (Life Technologies) in nuclease-free water (Invitrogen). For scATAC-Seq experiments, 5,000 -20,000 cells were sorted using a BD Influx machine into 1.5 ml tubes (Eppendorf). Following bulk tagmentation with Tn5 (Chen et al., 2018), single nuclei were index sorted in wells of 384-well plates (Eppendorf) prefilled with 2 µl of lysis buffer consisting of 0.2% SDS, 20 µg/ml proteinase K (Ambion), 50 mM Tris-HCl (Gibco) and 50mM NaCl (Sigma) in nuclease-free water.

Library preparation
The Smart-Seq2 method (Picelli et al., 2014) was used for library preparation for the scRNA-Seq experiments, with some modifications as described in (Macaulay et al., 2016). The quality of libraries was evaluated with Bioanalyzer (Agilent). Good-quality libraries were subsequently quantified with KAPA Library Quantification Kit (Roche) and submitted for sequencing. Library preparation for the scATAC-Seq experiments was performed using a recently described method (Chen et al., 2018). Library traces were evaluated using Bioanalyzer.

Sequencing
Libraries for scRNA-Seq experiments were multiplexed using Nextera Index sets A, B, C, and D (v.2, Illumina) and sequenced on HiSeq4000 and NovaSeq6000 (Illumina) in pair-end mode, with an interquartile range (IQR) of 697,427 uniquely mapped reads (average: 666,632; standard deviation: 557,274). Libraries for scATAC-Seq experiments were sequenced on HiSeq4000 in pair-end mode, with a mean read count of 473,886 and IQR 341,210.

Upstream analysis of scRNA-Seq data
Smart-Seq2 sample demultiplex fastq files were quality checked, aligned and quantified by using the scRNA-Seq pipeline. This pipeline is based on STAR with default parameters (v.2.5.4a) (Dobin et al., 2013) index and annotation from the Ensembl release 91 of the GRCh38 human reference genome. Transcript and gene counts were quantified using the option quantMode GeneCounts provided by STAR. Since we used different sets of welldefined antibodies to isolate different cell types, we applied specific thresholds for each sample to filter out both the cells and genes (Supplementary table 3). We detected on average 3,642 genes per cell (IQR: 2,239; standard deviation: 1,621).

Downstream analysis of scRNA-Seq data
In what follows, for each function that we applied, we only report the parameter settings we modified. All other parameter settings of the functions are the default ones provided by the used computational libraries.
We performed the downstream analysis of scRNA-Seq using the Python (v.3.6.9) package SCANPY (v.1.4.5.1) (Wolf et al., 2018). Our pipeline included: 1) a QC step (number of identified counts and number of expressed genes using the filter_cells function, and the fraction of mitochondrial genes). We obtained the 4,504 cells that were used in the next steps (Supplementary table 2 DEGs were computed by using rank_genes_groups function (Wilcoxon rank-sum with adjusted p-values for multiple testing with the Bonferroni correction), which compares each cluster to the union of the rest of the clusters. The clusters that either did not express specific cell type genes or expressed marker genes of different cell types had been iteratively subclustered. Specifically, we applied the Leiden algorithm (leiden function with resolution equal to 0.5) to subcluster Endothelial cells, obtaining four distinct clusters: the first two clusters have been annotated as Monocytes 2, the third as NK cells and the fourth as Endothelial cells. Finally, we used the Leiden algorithm (leiden function with resolution equal to 0.5) to cluster the Unspecified cluster getting four clusters. We merged three clusters with the HSC/MPP cluster while one was annotated as Unspecified.

Dimensionality reduction of scRNA-Seq data
After the detection of the first 1,000 HVGs, we applied scAEspy to HVG space by setting alpha and lambda equal to 0 and 2, respectively, in order to obtain the Gaussian Mixture Maximum For a given subset of cells from the biggest group and the smallest one, we calculated the DEGs by applying the rank_genes_groups function (Wilcoxon rank-sum with adjusted p-values for multiple testing with the Benjamini-Hochberg correction). Then, we filtered out the obtained DEGs by using the filter_rank_genes_groups function (min_in_group_fraction equal to 0.3 and max_out_group_fraction equal to 1, so that a gene is expressed in at least 30% of the cells in one of the two tested groups; min_fold_change equal to 0). Following the aforementioned workflow, we compared cells from the liver and femur from the same cluster.
Finally, we analysed HSC/MPPs and HSC/MPPs-Cycle to see which genes contributed to the observed difference between cells from femur and liver.
We also carried out a DE test to compare the expression of cell surface proteins in HSC/MPPs from femur and liver cells. As a first step, we selected 1) genes that encode CD molecules, 2) transmembrane genes available in CellPhoneDB (Efremova et al., 2020), and 3) genes that encode plasma membrane proteins from Uniprot (key KW-1003). Then, we applied the subsampling strategy comparing HSC/MPPs from femur and liver cells. Finally, we calculated the DEGs by considering only the genes that are expressed in at least 30% of the cells in one of the two tested groups.

Differentiation pathway analysis
We performed a gene-set enrichment analysis, using pathway databases, comparing liver and femur HSC/MPP cells. Firstly, we calculated DEGs by comparing liver and femur applying the Specifically, we applied the profile function provided by the Python GProfiler package for both liver and femur cells. As a query set, we used the liver (or femur) DEGs while as background we used the genes that are expressed in at least 30% of liver (or femur) cells. We also set the following parameters required by the profile function: organism equal to homo sapiens; sources equal GO terms, KEGG, and Reactome; domain_scope equal to custom_annotated; significance_threshold_method (i.e., the correction method for the p-values) equal to bonferroni; user_threshold (i.e., the threshold for the corrected p-values) equal to 0.01.

Cell type classification
We trained both a Random Forest classifier (Pedregosa et al., 2011) and a DNN to predict the cell types by considering the top 5, 10, 20, 30, 50, and 100 marker genes for each cluster using the log-normalised expression. Since some marker genes are shared among the clusters, we considered them only once to avoid duplicated columns in the feature matrices.
We As a further test, we evaluated the ability of our DNN to generalise on unseen data. We split the dataset into a train set (80%) and a test set (20%) (Scikit-learn train_test_split function with test_size equal to 0.2). We then divided the train set into a train set (85%) and a validation set (15%, (train_test_split function with test_size equal to 0.15). We trained our DNN with the train set, validating it using the validation set. When we took into account the top 30 marker genes, we achieved an accuracy equal to 91.50% on the validation set. When considering the top 50 marker genes the accuracy was 91.68%. Finally, we predicted the labels of the test set by obtaining an accuracy equal to 90.46% (30 marker genes) and 90.23% (50 marker genes).

Upstream analysis of scATAC-Seq data
We performed the upstream analysis using the samtools ( Next, we transformed bam files to bed files using bamtobed bedtools function in bedpe mode and kept only fragments that are no bigger than 1000 bp using a custom script. We called peaks (for the clusters with more than 50 cells) using the SnapATAC approach (Fang et

Downstream analysis of scATAC-Seq data
The downstream analysis was done in R 3.6.1 applying Seurat (Butler et al., 2018)

Dimensionality reduction of scATAC-Seq data
We applied the UMAP algorithm to the first 50 LSI components corrected by Harmony (RunUMAP function with umap.method equal to uwot and n.neighbors equal to 10, FindNeighbors function setting annoy.metric to cosine). We identified seven distinct clusters by using the Seurat function FindClusters (resolution equal to 0.5).

Trajectory analysis of scATAC-Seq data
We inferred the development trajectories by applying PAGA and FDG. We recalculated the neighbourhood graph using the SCANPY neighbors functions (n_neighbors equal to 30) on the 50 LSI components corrected by Harmony. We computed the PAGA graph (SCANPY paga function with model equal to v1.0) and used it to initialise the FA2 algorithm (SCANPY draw_graph function using cluster 1 as root and maxiter equal to 1,000).

Integration of scRNA-Seq and scATAC-Seq data
We integrated scRNA-Seq and scATAC-Seq data using a recently developed method by Stuart et al. (Stuart et al., 2019). Namely, we used our scRNA-Seq data as reference dataset to train the classifier and automatically assign a cell type to each scATAC-Seq cell. The training of the classifier was performed using 511 CD34+ CD38-cells from our scRNA-Seq experiment. In order to have a suitable number of cells for each cell type to train the classifier, we considered scRNA-Seq clusters with at least 20 cells (i.e., HSC/MPPs, HSC/MPPs-Cycle, MEMPs, MEMPs-Cycle, GPs, and LMPs). We generated a gene expression matrix from our scATAC-Seq data set by assigning each peak to the gene by considering the genome coordinates of the gene body ± 3 kb. We applied the Seurat function FindTransferAnchors (query.assay equal to RNA_promoter, features equal to the counts of the RNA_promoter, and k.anchor equal to 6) on the Canonical Correlation Analysis (CCA) space because it was more suitable, compared to the LSI space, for capturing the shared feature correlation structure between scRNA-Seq and scATAC-Seq data. We assigned the cell types to the scATAC-Seq cells by applying the Seurat TransferData on the first 50 LSI components corrected by Harmony considering the calculated anchors (refdata equal to the six scRNA-Seq clusters). In order to avoid assignments based on a low score, all cells with the prediction score lower than 40% (the value of a uniform distribution of six clusters is 16,67%) were labelled as unknown.

Transcription factor regulons prediction
To run SCENIC workflow on our raw scRNA-Seq data, we used an in-house constructed Snakemake pipeline via combining Arboreto package GRNBoost2 and SCENIC algorithms with default parameters. To predict transcription factor regulons, we used human v9 motif collection, as well as both hg38__refseq-r80__10kb_up_and_down_tss.mc9nr.feather and hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr.feather databases from the cisTarget (https://resources.aertslab.org/cistarget/). The resulting AUC scores per each cell and adjacency matrix were used for downstream analysis and visualization.

Isolation of human foetal MSCs
Human primary fMSCs were isolated from the femur of a 19 pcw sample following an established protocol used for mouse bones (Perpétuo et al., 2019). Briefly, the bone was rinsed in PBS and the bone epiphyses cut with a scalpel. The bone marrow was flushed with 50 ml PBS, centrifuged at 300 g for 5 minutes, resuspended in alphaMEM medium (Thermo Fisher Scientific) supplemented with 2 mM L-glutamine (Thermo Fisher Scientific), 100 U/ml penicillin/streptomycin (Thermo Fisher Scientific) and 10% fetal bovine serum (Sigma) at a concentration of 5x10 6 cells/ml and cultured at 37°C at 5% CO2. After 24 hours, floating cells were removed by washing twice with PBS and medium was changed twice a week until the culture was 70% confluent. Cells were cryopreserved until use.

Differences across cell types
In order to assess qualitative and quantitative differences between the haematopoietic cells collected from the liver and femur, we implemented different statistical tests. For each cluster, we calculated if there is a statistically significant difference in the number of cells (Test 1), the number of expressed genes per cell (Test 2), and the cell cycle state of blood cells collected from liver and femur (Test 3).

Test 1. Since we used different gates to sort cells and we sorted a different number of cells in
each experiment, we first normalised the number of cells from liver and femur. We selected only the matched gates (i.e., the gates where we sorted haematopoietic cells from both liver and femur). Then, we selected cells from the liver (or femur) from each gate in each of the clusters. For each cluster, we normalised the number of cells inside the cluster in the range [0, 100] by dividing the number of cells for the total number of cells of the gate in order to obtain a number of cells equal to 100. Next, for each cluster, we calculated the median of the cells in the liver (or femur) among the different gates. In order to evaluate if there is a statistically significant difference between the number of cells in the liver and femur considering all the clusters, we applied the ChiSq test by normalising the distributions (i.e, the median of the gates of each cluster) of the cells from liver and femur among the clusters. We applied the chi2_contingency function provided by the Python SciPy. Since the obtained pvalue is equal to 1.02×10 -4 , we applied Fisher's exact test (SciPy fisher_exact function) to each cluster to find which clusters contributed to the difference.