Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion

Understanding complex tissues requires single-cell deconstruction of gene regulation with precision and scale. Here, we assess the performance of a massively parallel droplet-based method for mapping transposase-accessible chromatin in single cells using sequencing (scATAC-seq). We apply scATAC-seq to obtain chromatin profiles of more than 200,000 single cells in human blood and basal cell carcinoma. In blood, application of scATAC-seq enables marker-free identification of cell type-specific cis- and trans-regulatory elements, mapping of disease-associated enhancer activity and reconstruction of trajectories of cellular differentiation. In basal cell carcinoma, application of scATAC-seq reveals regulatory networks in malignant, stromal and immune cells in the tumor microenvironment. Analysis of scATAC-seq profiles from serial tumor biopsies before and after programmed cell death protein 1 blockade identifies chromatin regulators of therapy-responsive T cell subsets and reveals a shared regulatory program that governs intratumoral CD8+ T cell exhaustion and CD4+ T follicular helper cell development. We anticipate that scATAC-seq will enable the unbiased discovery of gene regulatory factors across diverse biological systems. A droplet-based single-cell ATAC-seq method enables rapid mapping of chromatin accessibility in tens of thousands of cells.


Introduction
Tissues are comprised of a complex collection of cell types, each type specialized to its functional role. Understanding this inherent functional complexity often requires technologies that measure properties of single cells, rather than of the system as a whole. This concept is exemplified in the immune system, where effective responses to a wide array of pathogens or cancer are orchestrated by the sequential activity of more than a hundred specialized cell types. While recent studies have developed a number of robust technologies to measure RNA or protein expression in single cells, technologies for assessing chromatin accessibility in single cells are comparatively lacking.
Cell type-specific gene expression in eukaryotic cells is regulated by millions of cis-acting DNA elements (e.g. enhancers and promoters) and thousands of trans-acting factors (e.g. transcription factors [TFs] and regulatory RNAs) 1 . Functionally distinct cell types exhibit distinct gene expression programs that are brought about by changes in the activity of cis-elements through their interplay with trans-factors. We previously developed the assay for transposase-accessible chromatin using sequencing (ATAC-seq), which enables the enumeration of active DNA regulatory elements by direct transposition of sequencing adapters into accessible chromatin with the hyperactive transposase Tn5 2 . This method can reveal several layers of gene regulation in a single assay, including the genome-wide identification of cis-elements, inference of TF binding and activity, and nucleosome positions [2][3][4] . Importantly, ATAC-seq is applicable to low-cell number samples 5 , and even single cells 6,7 , which has enabled epigenomic profiling of primary samples with newfound precision. In studies to date, single-cell ATAC-seq (scATAC-seq) has been used to map cell-to-cell variability and rare single-cell epigenomic phenotypes across diverse biological processes, including in healthy and malignant immune cells [8][9][10][11][12] . However, the widespread adoption of this technique has been hindered by the difficulty and cost of performing the assay at scale while maintaining high data quality.
Here we report a method to perform scATAC-seq in nanoliter-sized droplets, which enables the generation of high-quality single-cell chromatin accessibility profiles at massive scale. To demonstrate the performance and utility of this method, we analyzed primary cells in two biological contexts. First, we mapped the single-cell chromatin accessibility landscape of blood formation in bone marrow and blood samples from healthy humans. This landscape revealed diverse chromatin states of progenitor cells and the regulatory trajectories of their differentiation into effector cell types. Second, we performed scATAC-seq in primary tumor biopsies from patients with basal cell carcinoma (BCC) receiving anti-programmed cell death protein 1 (PD-1) immunotherapy (PD-1 blockade). Single-cell deconvolution of the tumor microenvironment (TME) revealed distinct types of immune, stromal, and malignant cells, and analysis of intratumoral T cells identified epigenetic regulators of therapy-responsive T cell subtypes, including CD8 + exhausted (TEx) and CD4 + T follicular helper (Tfh) cells. Altogether, we report scATACseq profiles of over 200,000 cells, demonstrating that this platform enables the unbiased discovery of cell types and regulatory DNA elements across diverse biological systems.

Droplet-based platform for measuring single-cell chromatin accessibility
We developed a method to perform scATAC-seq in droplets using the Chromium platform (10X Genomics) previously employed to measure single-cell transcriptomes 13 or paired single-cell transcriptomes and T cell-or B cell-receptor sequences 14 (Fig. 1a,  Supplementary Fig. 1a). In this scATAC-seq approach, nuclei are first isolated from a single-cell suspension and transposed in bulk with the transposase Tn5. Transposed nuclei are then loaded onto an 8-channel microfluidic chip for Gel bead in Emulsion (GEM) generation. Each gel bead is functionalized with newly-designed single-stranded barcoded oligonucleotides that consists of: (1) a 29 basepair (bp) sequencing adapter, (2) a 16bp barcode selected from ~750,000 designed sequences to index GEMs, and (3) the first 14bp of Read 1N, which serves as the priming sequence in the linear amplification reaction to incorporate barcodes to transposed DNA (Supplementary Fig. 1a). In each channel, ~100,000 GEMs are formed per experiment, resulting in the encapsulation of tens of thousands of cells in GEMs per microfluidic chip. Approximately 80% of GEMs contain a single gel bead, and nuclei are loaded at a limiting dilution to minimize cooccurrence of multiple cells in the same GEM. After GEM generation, gel beads are dissolved, and the oligonucleotides are released for linear amplification of transposed DNA. Finally, the emulsion is broken, and barcoded DNA is pooled for PCR amplification to generate indexed libraries, which are compatible with high-throughput sequencing. In the sequencing reaction, reads 1N and 2N contain the DNA insert, while the index reads, i5 and i7, capture the cell barcodes and sample indices, respectively.
To assess the performance of this method, we generated scATAC-seq libraries from species-mixing experiments, in which we pooled human (GM12878) and mouse (A20) B cell nuclei. The nuclei from each cell type were first transposed in bulk, and then an even mixture of each cell type was loaded at various loadings onto microfluidic chips. The resulting libraries were sequenced and processed through a single-cell analysis pipeline to de-multiplex reads, assign cell barcodes, align fragments to the human or mouse reference genome, and de-duplicate fragments generated by PCR amplification (Cell Ranger ATAC; Methods). We first evaluated the quantity and quality of all scATACseq data, regardless of species of origin, and used previously described cut-offs of 1,000 unique nuclear fragments per cell and a transcription start site (TSS) enrichment score of 8 to exclude low-quality cells from further downstream analysis (Methods) 15 . Single cells passing filter yielded on average 23.24 x 10 3 unique fragments mapping to the nuclear genome, and approximately 40.5% of Tn5 insertions were within peaks present in aggregated ATAC-seq profiles from all cells, comparable to previously published highquality scATAC-seq and bulk ATAC-seq profiles (Fig. 1b-c and Supplementary Fig.  1b) 6,10,15 . Accordingly, scATAC-seq profiles exhibited fragment size periodicity and a high enrichment of fragments at TSSs, and aggregate profiles from multiple independent experiments were highly correlated ( Fig. 1d and Supplementary Fig. 1c). Finally, analysis of mouse and human fragments in single cells confirmed a low-rate of estimated multiplets (12/1,159 cells, ~1%; Fig. 1e). A cell titration experiment with four different cell loading concentrations showed a linear relationship between the observed multiplet rate and the number of recovered cells (~1% multiplets per 1,000 cells recovered), consistent with Poisson loading of nuclei and similar observations for single-cell transcriptomes in droplet platforms (Fig. 1e) 13,16 . Therefore, to minimize potential multiplets, we typically aimed to capture <6,000 nuclei per channel.

Validation of rare cell detection and performance in complex archival samples
We next sub-sampled scATAC-seq data in silico to assess the technical performance of scATAC-seq with different data quantity (fragments/cell) and cell number (Supplementary Fig. 1d-e). A comparison of each sub-sampled profile to the aggregate profile from all data demonstrated that a relatively complete list of cis-elements could be identified by aggregating ~200 cells and at least 10,000 fragments/cell. Aggregate profiles from ~200 cells could achieve the confident discovery of ~80% of ATAC-seq peaks from total profiles, and an overall Pearson correlation of r~0.9 for all reads in peaks (Supplementary Fig. 1d). With this information in mind, we devised a data analysis workflow for peak calling and clustering (Supplementary Fig. 1f, Methods). In this workflow, single-cell libraries were first processed with Cell Ranger and filtered. Next, we performed an 'initial' clustering analysis by partitioning the genome into 2.5 kb windows and counting Tn5 insertions in each window, as described previously 7,9 . We then performed single-cell Latent Semantic Indexing (LSI) and clustered cells using Shared Nearest Neighbor (SNN) clustering (SNN; Seurat 17 ) with the top 20,000 accessible windows, requiring that each cluster contain at least 200 cells. These 'initial' clusters were then used to identify confident ATAC-seq peaks (using MACS2 18 ) in single-cell clusters and to generate a merged peak set that represented the full epigenetic diversity of all cells. Finally, a cell-by-peak counts matrix was created and used for 'final' single-cell clustering and downstream analysis, in which each cluster could contain any number of cells (Methods).
We tested this analysis approach with two quality-control experiments. First, we generated a series of synthetic cell mixtures, in which human monocytes were isolated from peripheral blood mononuclear cells (PBMCs) and mixed with sorted human T lymphocytes in various ratios spanning a 1,000-fold detection range ( Supplementary  Fig. 2a-b, Supplementary Table 1). We then performed scATAC-seq and determined whether we could resolve each population in an unsupervised analysis. As expected, scATAC-seq analysis of 50:50 mixtures identified two distinct populations of cells, which demonstrated high accessibility of open chromatin regions linked either to monocytespecific genes (i.e. CD14, CSF1R, TREML4), or to T cell-specific genes (i.e. CD3E, CD4, CD8A; Supplementary Fig. 2a and Methods). Importantly, this analysis could also resolve small populations, which represented either 1/100 or 1/1,000 of total cells, demonstrating that even rare cell-types could be identified in this manner (Supplementary Fig. 2b). Second, we compared the performance of scATAC-seq analysis in fresh versus frozen PBMCs (Supplementary Fig. 2c-f). We isolated nuclei from either fresh PBMCs, viably-frozen PBMCs, or viably-frozen PBMCs that were sorted for live cells and performed scATAC-seq (Methods). As seen in previous datasets, we confirmed that scATAC-seq profiles passing filter yielded approximately the same quantity and quality of ATAC-seq data, regardless of sample origin, and that aggregate profiles from fresh and frozen cells were highly correlated (Supplementary Fig. 2c-d) 11 . Importantly, cell type-specific clusters from each PBMC sample clustered together, and frozen samples recapitulated the majority of ATAC-seq peaks discovered in fresh samples (AUC: 0.809; Supplementary Fig. 2e-f). Altogether, these results quantify the high similarity of scATAC-seq data generated in different batches and across different sample preparation conditions and demonstrate the ability to discover rare single cells in heterogeneous mixtures.

Unbiased single-cell accessible chromatin landscape of human hematopoiesis
To further demonstrate this method in primary samples, we performed experiments in human immune cells. The immune system relies on the continuous differentiation of hematopoietic stem cells (HSCs) into functionally-specialized cell types, a process which is tightly controlled by the expression of cell-type-specific genes and maintained by epigenetic programs (Fig. 2a). Since this system has been extensively studied using single-cell methods, we reasoned that it might be an ideal system to validate the interpretation of scATAC-seq data at scale. We generated scATAC-seq libraries from peripheral blood (PB) and bone marrow (BM) cells from 16 healthy individuals and sampled cells both in an unbiased fashion, analyzing total PB and BM cells, or after cell sorting to enrich for certain cell phenotypes (Supplementary Fig. 3a and Supplementary Table 2). The purpose of this sampling strategy was two-fold: (1) to validate scATAC-seq-defined cluster identities by standard cell surface marker phenotypes, and (2) to uncover additional heterogeneity within surface-marker defined populations with single-cell measurements. In total, we generated high-quality scATACseq profiles from 61,806 cells, and cells passing filter yielded on average 15.6 x 10 3 fragments mapping to the nuclear genome, and approximately 40.5% of Tn5 insertions were within peaks identified in the aggregate ATAC-seq profile from all cells (Supplementary Fig. 3b-c). Importantly, the quality of scATAC-seq profiles was highly uniform across individuals, samples, and cell types, and the number of fragments per cell and TSS enrichment scores were on par with high-quality scATAC-seq datasets in primary immune cells generated with other technologies (Supplementary Fig. 3d-e) 11,12 .
We clustered scATAC-seq profiles with LSI followed by SNN clustering and visualized clusters with uniform manifold approximation and projection (UMAP), a nonlinear dimensionality-reduction technique that preserves local and aspects of global inter-cluster relationships 19 . We identified 31 scATAC-seq clusters, which we classified using three parallel approaches: (1) chromatin accessibility of individual cis-elements (ATAC-seq peaks), (2) gene activity scores, which were computed from the accessibility of several enhancers linked to a single gene promoter 20 , and (3) TF activity, as computed from the accessibility of thousands of TF binding sites genome-wide in each single cell 4 .
All three approaches are based on a 'bottom-up' analysis of scATAC-seq data and do not require prior knowledge from RNA-seq datasets or reference bulk ATAC-seq profiles.
Using the first approach, we identified a total of 571,400 equal-width nonoverlapping cis-elements across all scATAC-seq clusters, and approximately 20.4% of elements (116,713) exhibited significant cell-type specific accessibility in a single cluster (mean 6,208 peaks per cluster, FDR < 0.01; Methods). Annotation of cell types through the identification of neighboring genes to cluster-specific cis-elements demonstrated that scATAC-seq profiles spanned the continuum from early hematopoietic progenitors to endstage cell types (Fig. 2b-c). For example, Clusters 2-4 demonstrated accessibility at ciselements neighboring key myeloid progenitor genes, including KIT, GATA1, TAL1, and SPI1, while Clusters 14-16 demonstrated accessibility at cis-elements neighboring B cell genes, including CD19, EBF1, and LYN ( Fig. 2c and Supplementary Fig. S4a). Importantly, clustering of scATAC-seq profiles could identify relatively rare cell types, such as basophils, as well as known cell type distinctions and subsets, such as the distinction between CD4 + and CD8 + T cells, and the presence of phenotypically-distinct T cell subsets, such as regulatory CD4 + T cells (Treg; Fig. 2b-c). Moreover, scATAC-seq analysis identified unique cell-type specific regulatory elements even within a single gene locus. For example, although the transcription factor IRF8 is critical for the function of many immune cell types, we observed unique accessibility of the +85kb and +87kb enhancers in the IRF8 locus in myeloid cells, and of the +54kb and +56kb enhancers in plasmacytoid dendritic cells (pDCs), while the +37kb enhancer was accessible in nearly all immune lineages (Fig. 2d). These findings are in line with previously-identified Irf8 super-enhancers in dendritic cells 21 , and potentially inform the cellular impact of genetic variants associated with autoimmune disease present in this locus 22 .
Although cis-element analysis can be informative, this measurement is naturally sparse in single cells, as it is limited by the DNA copy number (2 alleles per element in a diploid genome). Therefore, in the second analysis approach, we asked whether we could further support cluster identities using gene activity scores (henceforth referred to as 'gene scores'), which are computed as the normalized aggregate accessibility of several enhancers linked to a single gene promoter 20 . We first identified all enhancer-promoter (E-P) connections genome-wide with Cicero, an algorithm that links pairs of DNA elements based on co-accessibility in scATAC-seq data 20 . This method identified 149,309 total E-P connections across all scATAC-seq clusters, with a median of 6 enhancers linked to each gene promoter (Methods). We independently validated E-P connections obtained from this analysis using two orthogonal datasets from primary human immune cells. First, we compared E-P connections identified with Cicero to chromosome conformation signal obtained from H3K27ac HiChIP experiments in T cells 23 and found that enhancers linked to gene promoters showed significant enrichment for HiChIP enhancer interaction signal (EIS), compared to neighboring genomic regions (Supplementary Fig. 4b). Second, we compared Cicero E-P contacts to expression quantitative trait loci (eQTLs; Genotype-Tissue Expression [GTEx] database 24 ) and found enrichment of eQTLs in linked contacts, particularly when eQTLs were also identified in immune cells (Supplementary Fig. 4c; Methods). Together, these results confirm that globally, E-P contacts identified with this method are supported by three-dimensional genome conformation and by functional perturbations of enhancers.
We next calculated single-cell accessibility at E-P connections for each gene and projected gene scores for immune lineage-defining genes onto scATAC-seq clusters ( Fig.  2e and Methods). Indeed, we found that gene scores supported prior cis-elementdefined cluster identities, and that many of these scores were relatively robust in each cell within a cluster. For example, the CD34 gene score identified hematopoietic progenitors, the CD14 gene score identified monocyte and myeloid dendritic cells (mDCs), and the CD20 gene score identified B cells ( Fig. 2e and Supplementary Fig.  S4d-e). Again, this analysis robustly identified rare cell types, for example demonstrating high IL13 gene scores in basophils, and immune cell subsets, for example identifying high FOXP3 scores in Tregs (Fig. 4e). Across all single cells, we identified 5,977 gene scores that exhibited scATAC-seq cluster-specific activity, reflecting known and novel markers for each cell type (Supplementary Fig. 4d).
Finally, in the third analysis approach, we measured chromatin accessibility at all cis-elements sharing a common feature (such as a TF binding motif) using chromVAR 4 . To demonstrate the viability of this method, we analyzed aggregate scATAC-seq cluster profiles for expected differences at binding sites for known cell type-specific TFs. Indeed, genome-wide accessibility at binding sites for GATA2, a lineage-determining factor for megakaryocyte, erythrocyte, and basophil lineages 25 , was increased in megakaryocyteerythroid progenitors (MEPs), in basophils, and in common myeloid progenitors (CMPs), as expected (Fig. 2f). Similarly, the accessibility of binding sites for EBF1, the lineagedetermining factor for B cells 26 , was increased in naïve, memory, and plasma B cells, as well as in early B cell progenitors (Fig. 2f). Since DNA bound by TFs is protected from transposition by Tn5, visualization of each TF profile showed local chromatin accessibility changes surrounding the binding 'footprint' (Fig. 2f and Supplementary Fig. 5a). Next, we computed the genome-wide accessibility for all TF motifs in each single cell using chromVAR (referred to as 'chromVAR TF deviation'), which revealed shared and unique regulatory programs across immune cell types (Fig. 2g-h and Supplementary Fig. 5b). For example, mDCs and B cells shared activity at sites containing BCL11A, SPI1, and IRF factor motifs, but demonstrated unique activity at sites containing motifs for CEBP factors and EBF1, respectively ( Fig. 2g-h). Similarly, TBX21-and EOMES-bound sites were active in both NK and T cell populations; however, only T cells showed accessibility at sites containing motifs for the T cell lineage-determining factor TCF7 ( Fig. 2g-h) 27 .
We asked whether we could similarly group cis-elements according to the presence of disease-associated genetic variants, rather than TF motifs, in order to nominate cellular determinants of autoimmune disease. To achieve this, we used a list of fine-mapped causal variants associated with 21 autoimmune diseases and 18 nonimmune diseases 22 and generated a feature set for each disease that consisted of variantcontaining ATAC-seq peaks. To increase the statistical power of this analysis, we also included all co-accessible elements to those containing causal variants (identified using Cicero; Supplementary Fig. 5c). We then measured chromatin accessibility change in all variant-associated regions genome-wide in each single cell to nominate potentially causal cell types in each disease (chromVAR; Supplementary Fig. 5c-d). Strikingly, this analysis revealed several distinct patterns of cell type-specific accessibility change among autoimmune diseases. As previously observed, several diseases, such as celiac disease, Type 1 diabetes, Crohn's disease, and juvenile arthritis, primarily showed high accessibility of variant-containing putative enhancers (henceforth termed 'variantenhancers') in T cell populations (Supplementary Fig. 5d) 22 . Other diseases, such as Kawasaki disease, multiple sclerosis, and systemic lupus erythematosus, showed high accessibility of variant-enhancers in B cells -either specifically, or in addition to accessibility in T cells -as previously described (Supplementary Fig. 5d) 22 . However, the comprehensive scale of scATAC-seq cell types enabled the discovery of novel patterns. For example, variant-enhancers associated with systemic sclerosis showed high accessibility in mature NK cells and pDCs, and variant-enhancers associated with ulcerative colitis showed high accessibility in mDCs and monocytes, consistent with the suggested impact of these cell types in murine models of each disease 28,29 . Additional diseases with high variant-enhancer signals in myeloid cells included a number of metabolic traits and diseases, such as fasting glucose and HDL cholesterol levels, and Type 2 diabetes, suggesting that cell-specific enhancers nominate regulatory roles for immune cells in these processes as well. The association of individual disease variants with cell type-specific enhancers could be confirmed by H3K27ac HiCHIP measurements in primary cells (Supplementary Fig. 5e). Altogether, these results demonstrate that scATAC-seq data may be analyzed without guidance from bulk data to identify cell types and their associated chromatin accessibility landscapes, and to examine the enrichment of these landscapes for polymorphisms associated with human disease.

Regulatory trajectories of immune lineages
Since this dataset included progenitor and effector cell types, we asked whether the density of scATAC-seq data could be used to reconstruct cellular developmental trajectories in an unbiased manner, without the use of pre-defined markers. As a test case, we aimed to reconstruct the lineage trajectory of plasma B cell differentiation, since: (1) the entirety of this developmental program occurs in the bone marrow and blood and thus ought to be captured in our dataset, and (2) the regulatory mechanisms of this process are relatively well-defined for comparison (Fig. 3a). To achieve this, we used a nearest-neighbor approach on existing cluster definitions (Fig. 3a). We started with the plasma B cell cluster (Cluster 16) and attempted to return to the HSC cluster (Cluster 1) by sequentially selecting precursor-cell relationships with the most epigenetic similarity to the cluster of interest (computed from Euclidean distances of ATAC-seq profiles; Methods). For example, the nearest neighbor to the memory B cell cluster (Cluster 15) was the naïve B cell cluster (Cluster 14). The nearest neighbor to naïve B cells was the Pre-B cell cluster, and so on (Fig. 3b). Indeed, this reverse reconstruction process correctly identified the well-established cellular trajectory of plasma B cell development as the most significant trajectory among all tested trajectories (p<0.0002; 5,000 permutations). Finally, we generated an ordering of single cells (henceforth referred to as 'pseudotime') along this trajectory by computing a pseudotime vector across lineage clusters and aligning each cell to the vector in the UMAP projection ( Fig. 3c and Methods).
We next utilized this single-cell pseudotime to identify stage-specific activities of cis-elements and trans-factors during plasma B cell differentiation. An analysis of ~10,000 cis-elements with dynamic accessibility patterns across the trajectory revealed specific dynamic regulatory elements near several known regulators of every stage of B cell development (Fig. 3d). For example, cis-elements that were accessible early in the trajectory included enhancers for EBF1, RUNX1, IL7R, RAG2, and MEF2C, factors that are critical for B cell lineage specification and diversity (Fig. 3d) 26,30,31 . Conversely, ~3,500 cis-elements that were accessible late in the trajectory included elements proximal to PRDM1, a critical transcription factor for plasma cell fate, and the plasma cell-specific marker SDC1 (CD138). We also determined whether examination of chromVAR deviation scores across this developmental ordering of trans-factors could identify critical TFs involved in B cell development. Since chromVAR TF deviation scores calculated from scATAC-seq data can reflect the potential activity of many TFs with similar DNA-binding motifs, we integrated chromVAR deviations with dynamic cis-element gene scores to prune the data for the activity of specific TFs within a motif family (Fig. 3e). Indeed, this method accurately identified several previously-identified TFs that are critical for B cell differentiation (Fig. 3e). More importantly, this method could resolve fine differences in the timing of TF activity. For example, MEF2C activity was observed early in B cell development -at the stage of common lymphoid progenitors (CLPs) -consistent with its demonstrated role in lymphoid versus myeloid fate specification 31 . Immediately after the induction of MEF2C activity, we observed the sequential activity of EBF1, PAX5, and IRF4, recapitulating the known order of their physiological functions in pro-B cells, pre-B cells, and naïve B cells, respectively ( Fig. 3f) 26 . Altogether, these results indicate that pseudotime ordering of scATAC-seq data can be used to accurately identify cis-and trans-regulatory factors controlling cellular differentiation.
We next applied trajectory analysis to the early stages of hematopoiesis to identify TF regulators of myeloid fate decisions, particularly of dendritic cells (DCs) -a relatively rare population of cells sparsely sampled in prior epigenomic studies. We first re-clustered 16,015 progenitor and DC scATAC-seq profiles (defined in Fig. 2) and 2,074 profiles of surface marker-defined progenitors, generated in a previous study (Fig. 3g) 11 . We identified 16 sub-clusters, and the projection of the sorted scATAC-seq profiles onto de novo-defined clusters revealed regulatory relationships between progenitors and significant heterogeneity in marker-defined states ( Fig. 3h-i). Globally, immune lineages appeared to diverge early via three distinct branches to: (1) megakaryocyte/erythroid (Meg/E) and basophil/eosinophil (Bas/Eo) fates, (2) lymphoid fates, or (3) neutrophil/monocyte/DC fates. However, sorted progenitors did not always occupy a single de novo-defined regulatory state. For example, classically-defined CMPs (Lineage -CD34 + CD38 + CD10 -CD45RA -CD123 mid ) were present in four de novo-defined clusters, including in committed pathways leading to either neutrophil/monocyte/DC fates (Clusters 2 and 11), Meg/E fates (Clusters 4-5), or Baso/Eo fates (Clusters 3-4; Fig. 3h-i). This result suggests that CMPs are heterogeneous, and that while CMPs as a population contain lineage potential for all myeloid fates, the vast majority of single cells within this population are already biased toward specific lineages. Similarly, granulocytemacrophage progenitors (GMPs; Lineage -CD34 + CD38 + CD10 -CD45RA + CD123 mid ) demonstrated a similar phenomenon and were present in four clusters downstream of the CMP (Clusters [11][12][13][14], including those leading to neutrophil differentiation, as well as previously unrecognized clusters leading to mDC and pDC fates ( Fig. 3h-i).
We used pseudotime ordering to identify TF trajectories in progenitor clusters, again with a specific focus on DC development (Fig. 3j). This analysis revealed both shared and unique TF programs across myeloid fates. For example, while Meg/E and Bas/Eo progenitors shared accessibility at GATA2 motifs, Bas/Eo commitment was characterized by SPI1 (PU.1) and CEBPA motif activity, while Meg/E commitment was characterized by MYB, GATA1, and KLF1 motif activity, as previously described ( Fig.  3k) 32,33 . Similarly, while neutrophil progenitors shared increased accessibility at SPI1 motifs with Bas/Eo progenitors, neutrophil commitment was accompanied by additional activity of AP-1 and CEBP motifs, and of RARA (Fig. 3k). Finally, the analysis of trajectories towards DC fates revealed three distinct possible developmental trajectories. The mDC pathway transitioned through CMP and GMP clusters, and then to Cluster 13 (monocyte-dendritic cell progenitor; MDP) and Cluster 14 (common dendritic progenitor; CDP), prior to terminal mDC differentiation. This trajectory showed accessibility at IRF8, IRF4, BCL11A, SPI1, KLF4, AP-1, and RBPJ motifs, consistent with critical roles of each of these factors in DC differentiation 34 . Importantly, there was a clear ordering of each factor's activity in early versus late differentiation; IRF8, BCL11A, and SPI1 motifs exhibited accessibility early in CDPs, while AP-1 and RBPJ factors increased in accessibility late in terminal differentiation (Fig. 3k). For pDCs, two possible trajectories could be observed, supporting previous reports that this lineage can arise from both myeloid-and lymphoid-committed progenitors [35][36][37] . A first pDC trajectory transitioned directly from lymphoid-primed multipotent progenitors to differentiated pDCs, while a second trajectory traversed CMP, GMP, MDP, and CDP stages prior to pDC differentiation (Fig. 3k). Analysis of TF deviations revealed that each pathway relied on the same regulatory program, which included RUNX, IRF8, SPIB, BCL11A, and TCF4 factors, as previously demonstrated in murine and human pDCs 34 . Importantly, we did not observe significant epigenomic heterogeneity within terminal pDCs, suggesting that divergent cellular trajectories can achieve nearly identical cell states through activation of a common regulatory program.

Single-cell chromatin landscape of intratumoral immunity
We next applied this method to primary solid tumor biopsies from BCC patients receiving PD-1 blockade. BCC is the most common cancer in humans worldwide, and recent studies demonstrated that a subset of patients with advanced BCC show significant clinical benefit from immunotherapies based on blocking the T cell inhibitory receptor PD-1 38 . However, as in many other cancers, PD-1 blockade is clinically ineffective in more than half of BCC patients 39,40 . Thus, our goal was to identify cell types that were responsive to therapy and the regulatory mechanisms controlling their activity in responder versus non-responder patients. In addition, these experiments demonstrated the feasibility of applying this method to sparse samples (as low as 500 sorted cells) from clinical biopsies.
We performed scATAC-seq on serial tumor biopsies pre-and post-PD-1 blockade (pembrolizumab) from 5 patients, plus post-therapy biopsies from 2 additional patients (total of 7 patients; 14 timepoints sampled; Fig. 4a and Supplementary Table 3). All patients had histologically-verified locally advanced or metastatic BCC and were poor candidates for surgical resection. To minimize non-therapy-related immune cell variation, we excluded patients with prior exposures to checkpoint blockade, or to systemic immune suppressants within 4 weeks of biopsy. We dissociated tumors into single-cell suspensions using enzymatic dissociation and sampled cells both in an unbiased fashion, analyzing total cells, or after cell sorting to enrich for T cells (CD45 + CD3 + ), non-T immune cells (CD45 + CD3 -) and/or stromal and tumor cells (CD45 -; Supplementary Fig. 6a and Methods). To enable pre-and post-therapy cell comparisons, biopsies were site-matched in each patient, and when possible, cells were sampled in the same manner at each timepoint. In total, we generated high-quality scATAC-seq profiles from 37,818 cells. Cells passing filter yielded on average 15 x 10 3 unique fragments mapping to the nuclear genome, and approximately 62.5% of Tn5 insertions were within peaks present in the aggregate ATAC-seq profile from all cells, demonstrating that we could obtain high-quality scATAC-seq profiles from solid tumor biopsies ( Fig. 4b and Supplementary Fig. 6b-d).
We clustered scATAC-seq profiles with LSI followed by SNN clustering and visualized clusters with UMAP, and identified 20 scATAC-seq clusters ( Fig. 4b-c). Classification of clusters using cis-element accessibility and gene scores revealed a diverse ecosystem of cell types in the BCC TME, including 9 T cell clusters (high accessibility of CD3E, CD8A, and CD4 enhancers), 2 natural killer (NK) cell clusters (high accessibility of KLRC1 and NCR1 enhancers), B cells and plasma cells (high accessibility of CD19 and SDC1 enhancers, respectively), myeloid cells that comprised mDCs and tissue macrophages (high accessibility of CD86, CSF1R, and FLT3 enhancers), stromal endothelial cells and fibroblasts (high accessibility of CD31 and COL1A2 enhancers, respectively), and 4 tumor cell clusters (high accessibility of KRT14 enhancers; Fig. 4bd and Supplementary Fig. 6e). Notably, stromal and immune cells from different patients clustered together, demonstrating that these clusters did not represent patient-specific cell states or batch effects. In contrast, tumor cell clusters were largely patient-specific, consistent with prior single-cell RNA-seq studies in melanoma and head and neck cancer 41,42 , and perhaps reflecting cell state changes driven by patient-specific genome alterations (Fig. 4c). To identify potential genome alterations driving each tumor cell state, and to further support the distinction of malignant and non-malignant cells, we estimated copy number variation (CNV) from scATAC-seq data ( Fig. 4e and Methods). This analysis revealed CNVs in Clusters 17-20, compared to other stromal cell populations. For example, tumor cells in patient SU010 showed ATAC-seq signal consistent with amplifications of regions of chromosomes 3 and 6, which were present in both pre-and post-therapy samples (Fig. 4e). Finally, we analyzed TF activity in each cell type and found distinct patterns of activity in immune cells, compared to stromal or tumor cells (Fig.  4f). Immune cells displayed high accessibility of TBX21, EOMES, RUNX, BCL11A, SPI1, and IRF motifs, while stromal and tumor cells displayed high accessibility of NFkB, CEBP, p63, and AP-1 motifs ( Fig. 4f and Supplementary Fig. 7a-b). Moreover, tumor cells showed high accessibility of GLI1 motifs, consistent with the critical role of the Hedgehog signaling pathway in BCC (Supplementary Fig. 7b) 43 .

Epigenetic landscape of T cell exhaustion after PD-1 blockade
We asked whether TME chromatin landscapes could identify epigenetic regulators of the anti-tumor T cell response. Since T cells can be activated by targeting either T cell inhibitory receptors (such as PD-1) or by targeting inhibitory receptor ligands (such as PD-L1) expressed on stromal cells, we examined this question in the context of both T and stromal cell populations. First, we analyzed cis-elements near genes encoding the known inhibitory ligands, CD47, TGFb, and PD-L1 (Fig. 4g) [44][45][46] . Analysis of scATAC-seq clusters identified distinct cis-element patterns for each gene across stromal and tumor clusters. For example, we identified 3 differentially-accessible cis-elements (-35kb, +97kb, and +103kb) in the CD47 locus, consistent with previously identified functional enhancers controlling CD47 expression (Fig. 4g) 47 . Importantly, the tumor necrosis factor-and NFkB-responsive +97kb and +103kb enhancers were specifically accessible in tumor cells, and not stromal cells, supporting previous reports that CD47 expression on tumor cells is responsive to inflammatory signals and may mediate escape from immune surveillance 47 . Similarly, in the TGFB1 locus, we identified 3 cis-elements (+29kb, +30kb, and +32kb) that were primarily accessible in stromal cells (endothelial cells and fibroblasts), consistent with the expression pattern of this gene in primary tumors ( Fig. 4g) 45 . We also identified 3 cis-elements in the PDL1 locus (+5kb, +9kb, and +43kb), which were previously shown to be active in 23 human cancers types in The Cancer Genome Atlas 48 . scATAC-seq data demonstrated the accessibility of these sites in tumors cells, stromal cells, and myeloid and B cells, supporting the broad expression pattern of this ligand, and suggesting that its expression may rely on the identical cis-regulatory elements in each cell type (Fig. 4g).
We next examined regulatory landscapes of intratumoral T cells. We first reclustered 23,274 T cells (defined in Fig. 4) and identified 19 sub-clusters of intratumoral T cell states, revealing a rich diversity of T cell phenotypes in the TME (Fig. 5a). Classification of clusters using cis-element accessibility and gene scores for CD8A and CD4 showed that 6 clusters represented CD8 + T cell states and 13 clusters represented CD4 + T cell states (Fig. 5b). CD8 + T cell states included naïve T cells (high accessibility of CCR7 and TCF7 enhancers), effector T cells (high accessibility of EOMES and IFNG enhancers), memory T cells (high accessibility of EOMES enhancers, but low accessibility of effector gene enhancers), and exhausted T cells (TEx; high accessibility of ciselements proximal to inhibitory receptor genes PDCD1, CTLA4, and HAVCR2, and to T cell dysfunction genes CD101 and CD38 49 ; Fig. 5b and Supplementary Fig. 8a-b). We also identified a cluster consistent with an intermediate TEx cluster (Cluster 16), that exhibited gene score patterns of both TEx and memory T cells (Fig. 5b). CD4 + T cell states included naïve T cells and several T helper cell phenotypes, such as Tregs (high accessibility of FOXP3 and CTLA4 enhancers), T helper 1 (Th1) cells (high accessibility of IFNG and TBX21 enhancers), T helper 17 (Th17) cells (high accessibility of IL17A and CTSH enhancers), and T follicular helper (Tfh) cells (high accessibility of CXCR5, IL21, and BTLA enhancers; Fig. 5b and Supplementary Fig. 8a-b).
We focused on CD8 + TEx cells because previous studies demonstrated that this population is enriched for clonally-expanded tumor-specific T cells 41,50 , and that irreversibility of the TEx epigenetic state may limit the re-invigoration of tumor-specific T cells after PD-1 blockade 51 . Indeed, a comparison of pre-and post-PD-1 blockade profiles in our dataset showed that TEx cells were the most highly-expanded T cell cluster (Cluster 17) after therapy (Fig. 5c). More than 90% of TEx cells were derived from post-therapy biopsies, whereas memory and effector CD8 + clusters (Clusters 12 and 14) were equally derived from both timepoints. Notably, compared to other CD4 + T cell types, we observed an expansion of CD4 + Tfh cells post-therapy -nearly to the same extent as TEx cellscompared to other CD4 + T cell types, suggesting that PD-1 blockade impacts both CD4 + and CD8 + cell states in the TME (Fig. 5c). We first analyzed cis-regulatory landscapes in TEx cells to: (1) measure global changes in chromatin accessibility in TEx cells, and (2) nominate individual cis-elements that may regulate TEx-specific inhibitory receptor expression. Across all T cell states, we identified 35,147 cis-elements that were specifically accessible only in a single cluster (mean: 3,361 peaks per cluster, range: 979 peaks (naïve CD8 + T)-10,808 peaks (Tregs), FDR < 0.01; Supplementary Fig. 8d). In TEx cells, we identified 4,598 such elements, demonstrating that human T cell exhaustion is accompanied by substantial global remodeling of the chromatin accessibility landscape, and that the epigenetic state occupied by this cell state is as distinct as most other recognized T cell states, consistent with prior studies in mice 49,51,52 . Analysis of individual TEx-specific enhancers identified novel regulatory elements in inhibitory receptor loci (Fig. 5d). For example, the PDCD1 locus (encoding PD-1) contained an intragenic ciselement (+5kb) with specific accessibility in TEx cells, suggesting that the persistent expression of PD-1 in exhausted human T cells is controlled by a single state-specific enhancer, and that the regulation of persistent PD-1 expression may be different in humans and mice 53 . In contrast, the CTLA4 and HAVCR2 loci showed TEx-specific activity of several distal cis-elements, compared to other CD8 + T cell states (Fig. 5d).
We used pseudotime ordering to identify TF trajectories in TEx differentiation, compared to effector or memory CD8 + T cells (Fig. 5e). Since we also observed an expansion of CD4 + Tfh cells post-therapy, we included this cell type in our comparison. As expected, the differentiation of naïve CD8 + T cells to either effector or memory cells identified the critical roles of EOMES and TBX21 (T-bet) motifs in effector and memory cell formation, once again supporting the validity of this analysis [54][55][56] (Fig. 5f). Effector cell pseudotime also demonstrated the late accessibility of other known regulators, including TFAP4 and YY1, among others 57,58 . Similarly, memory cell pseudotime also showed accessibility at HIF1A and E protein sites 59 . In contrast, TEx cells showed a distinct regulatory program, which progressed through two stages of differentiation (Fig.  5g). The first stage (intermediate TEx) showed new accessibility of cis-elements near inhibitory receptors, as well as elements near genes associated with tissue residency, such as ITGAE (CD103) 60 . Accordingly, this stage of differentiation was accompanied by accessibility of NR3C1 and NR4A1 motifs, factors immediately downstream of T cell receptor (TCR) signaling 61 , as well as the RUNX3 motif, a factor that programs tissue residency of CD8 + T cells in tumors (Fig. 5g) 62 . The second stage (terminal TEx) showed new accessibility of cis-elements near genes associated with terminal T cell dysfunction, such as CD101 and TOX 49,51 , as well as of new elements in stage 1 gene loci, such as CTLA4 (Fig. 5g). Importantly, this stage was accompanied by accessibility of a core set of TF motifs, which included NFKB1 and NFKB2, BATF, IRF4, and NFATC1, factors that are all directly downstream of TCR signaling, and three of which have been demonstrated to play crucial roles in T cell exhaustion in mice [63][64][65] .
Finally, we examined the epigenetic relationship between TEx and Tfh cells. Tfh cells have previously been observed in tumors and have been proposed as a prognostic indicator of patient survival and response to checkpoint blockade [66][67][68] .The inferred differentiation trajectory from CD4 + naïve T cells to Tfh cells showed new accessibility of cis-elements neighboring Tfh-specific genes, such as IL21 and BTLA, but also of elements near genes typically associated with TEx cells, such as inhibitory receptors, consistent with the known, but unexplained, expression of these genes in human Tfh cells ( Fig. 5h and Supplementary Fig. 8c-e) 69 . Strikingly, differentiation was accompanied by the accessibility of Tfh regulators, but also of the same core set of TF motifs associated with TEx differentiation, including NFKB2, BATF, IRF4, and NFATC1, suggesting a common program driving the development of TEx and Tfh cells downstream of PD-1 blockade (Fig. 5h-i and Supplementary Fig. 8f). Indeed, the abundance of TEx and Tfh cells was similar in all patients post-therapy, and in our small cohort, was associated with therapy response (Fig. 5j). Altogether, these results map the epigenetic landscape of intratumoral TEx cells in humans and suggest that chronic TCR signals drive a shared regulatory program in TEx and Tfh cells after PD-1 blockade (Fig. 5k).

Discussion
Recent advances in high-throughput single-cell profiling have driven new insights into cell types, RNA expression, and protein markers underlying biological processes. However, to date, single-cell chromatin accessibility profiling has been constrained by trade-offs between data quality, throughput in cell numbers, and cost per cell. Here we report a droplet-based solution for highly multiplexed single-cell chromatin accessibility profiling. The ATAC-seq profiles generated using this method are high-quality and enable an unbiased deconvolution of cis-and trans-regulatory elements underlying chromatin states with single-cell resolution. This massive scale of cell type and cell state reconstruction affords three key advantages: (1) comprehensive deconvolution of all cells in a tissue, including rare cell types and states, (2) unbiased reconstruction of cellular developmental trajectories, without the use of pre-defined markers, and (3) analysis of active regulatory DNA down to the level of individual genes and elements in single cells. The droplet-based scATAC-seq library cost is ~$0.4 per cell, has a lower multiplet rate compared to prior methods, and does not require cell sorting or non-commercial reagents.
Chromatin accessibility states of regulatory DNA encode and presage cell fates 8,70 . scATAC-seq is thus well-suited to track trajectories of cell differentiation, which can occur either as discrete steps or as a continuum of similar states. We developed a bottom-up, data-driven approach to iteratively group single cells together based on their accessible genomes, used groups of highly similar cells to reconstruct detailed cell type-specific cisand trans regulatory maps, and highlighted disease-associated enhancers that are uniquely active in specific cell types. Moreover, the dense single-cell clusters enabled unbiased computational inference of sequential cell state transitions that could reconstruct the developmental trajectories of individual cell types, for example recapitulating decades of research on B cell and DC development. Importantly, scATACseq of tumor-infiltrating lymphocytes from patient biopsies identified new regulatory programs controlling T cell exhaustion and a previously unrecognized shared program with T follicular helper cells. It is tempting to speculate that this shared program may reflect an evolutionarily conserved pathway to synchronize CD4 + and CD8 + T cell responses to chronic pathogen infection, such that CD4 + Tfh cells support antibody formation as well as long-term activation of CD8 + T cells, perhaps through IL-21 [71][72][73] . Nevertheless, future studies targeting these regulatory pathways may nominate therapeutic interventions that synergize with PD-1 blockade in cancer. In summary, we describe a method for generating large-scale single-cell chromatin accessibility profiles on a widely-distributed single-cell investigation platform, enabling unbiased discovery of cell types and regulatory DNA elements in complex tissues.

Acknowledgments
We thank members of the Chang and Greenleaf laboratories and 10x Genomics for helpful discussions. We thank the following people at 10x Genomics: Alaina Puleo for sorting cells, John Chevillet for training, Zachary Bent and Michael Dodge for reagents development, Rachel Gerver and Wei Wang for microfluidics, and Abby Gallegos, Alvaro Gonzales, Nikka Keivanfar, Shamoni Maheshwari, Patrick Marks, Jeff Mellen, Rudy Rico, and Kevin Wu for computational and software support. We thank Xuhuai Ji, Dhananjay Wagh, and John Coller at the Stanford Functional Genomics Facility, and Charles Bruce at 10x Genomics for sequencing support and Alexander Valencia for assistance with clinical specimen processing.

Human subjects
This study was approved by the Stanford University Administrative Panels on Human Subjects in Medical Research, and written informed consent was obtained from all participants.

Cell lines and PBMC/BM samples
Human (GM12878) and Mouse A20 (ATCC TIB-208) B Lymphocytes were acquired and cultured according to guidelines from Coriell and ATCC, respectively. Fresh PBMCs, GM12878, and A20 cells were frozen according to the instructions outlined here: https://assets.ctfassets.net/an68im79xiti/2ptJYphPcPGfSPisq0cVuu/c8a83f93383c2fd1c e7cc49abc837992/CG000169_DemonstratedProtocol_NucleiIsolation_ATAC_Sequenci ng_Rev_B.pdf. Briefly, PBMCs were cryopreserved in IMDM + 40% FBS + 15% DMSO. GM12878 and A20 cells were cryopreserved in RPMI + 15% FBS + 5% DMSO. For the monocyte and T cell mixing experiments, nuclei were first extracted and transposed, then mixed at indicated ratios. To avoid pipetting errors, a large number of nuclei were mixed after nuclei extraction and transposition, and a smaller number of nuclei were loaded onto the microfluidics chip for scATAC library generation. We also performed a similar mixing experiment using naïve and memory T cells (Supplementary Table 1), which performed similarly and is included in the available data section.

BCC sample collection and cell sorting
Fresh BCC biopsies were digested in 5 mL DMEM/F12 media + 250 µg/mL Liberase TL and 200 U/mL DNAse I with the gentleMACS Octo system at 37°C for 3 hours at 20 rpm. After tissue pieces were fully digested, 50 µL of 500 mM EDTA was added and samples were collected by centrifugation at 300xg for 5 minutes. Single-cell suspensions were filtered through 70 µm mesh and pelleted by centrifugation at 300xg at 4°C for 10 minutes. Finally, cells were then resuspended in 1 mL of RPMI media and cryopreserved in FBS supplemented with 10% DMSO.

Single-cell ATAC-Seq using the 10x Chromium platform
All protocols to generate scATAC-seq data on the 10x Chromium platform, including sample prep, library prep, instrument and sequencing settings, are described below and are also available here: https://support.10xgenomics.com/single-cell-atac.

Nuclei Isolation
The isolation, washing, and counting of nuclei suspensions were performed according to the Demonstrated Protocol: Nuclei Isolation for Single Cell ATAC Sequencing (10x Genomics). Briefly, anywhere from 100,000 to 1,000,000 cells were added to a 2 ml microcentrifuge tube and centrifuged (300 rcf for 5 min at 4°C). The supernatant was removed without disrupting the cell pallet and 100 µl chilled Lysis Buffer (10 mM Tris-HCl (pH 7.4); 10 mM NaCl; 3 MgCl2; 0.1% Tween-20; 0.1% Nonidet P40 Substitute; 0.01% Digitonin and 1% BSA) was added then pipette mixed 10 times. The microcentrifuge tube was then incubated on ice, with the length of time optimized for each cell type: GM12878 and A20 cell lines were 5 min; PB and BM cells were 3 min. Following lysis, 1 mL of chilled Wash Buffer (10 mM Tris-HCl (pH 7.4); 10 mM NaCl; 3 MgCl2; 0.1% Tween-20 and 1% BSA) was added and the resulting solution and pipette mixed 5 times. Nuclei were centrifuged (500 rcf for 5 min at 4°C) and the supernatant removed without disrupting the nuclei pellet. Based on the starting number of cells and desired final nuclei concentration, an appropriate volume of chilled Diluted Nuclei Buffer (10x Genomics; 2000153) was used to resuspend nuclei. The resulting nuclei concentration was determined using a Countess II FL Automated Cell Counter. Nuclei were then immediately used to generate single cell ATAC-seq libraries as described below.

Availability of Data Processing and Analysis Software
All data processing steps and methods used in the manuscript are described in detail below. We also have designed and made the following tools freely available: Cell Ranger ATAC: This software performs initial data processing of scATAC sequencing reads (including de-multiplexing, genome alignment, and read de-duplication), as described below and used in this manuscript. This software will also perform additional downstream analysis, including the identification of open chromatin regions, motif annotations, and differential accessibility analysis, similar to what was performed in this manuscript and described below. https://support.10xgenomics.com/single-cell-atac/software/pipelines/latest/what-is-cellranger-atac Loupe Cell Browser: This is an interactive visualization software that shows ATAC-seq peak profiles for scATAC-seq cell clusters, similar to the analysis done in this manuscript and described below. https://support.10xgenomics.com/single-cell-atac/software/visualization/latest/what-isloupe-cell-browser

Data Processing using Cell Ranger ATAC Software
The Cell Ranger (CR) Software (version 1.0; https://support.10xgenomics.com/singlecell-atac/software/pipelines/latest/algorithms/overview) was used for alignment, deduplication, and identification of transposase cut sites. First, the 16bp barcode sequence was processed to fix the occasional sequencing error in barcodes. Barcode sequences were obtained from the I2 index reads. An observed barcode not present in the whitelist of barcodes can be corrected to a whitelist barcode if it is within 2 hamming distance away, and has >90% probability of being the real barcode (based on the abundance of the barcode and quality value of incorrect bases). Then, the cutadapt tool was used to identify and trim any adapter sequence in each read. Third, the trimmed read pairs were aligned to a reference using BWA-MEM with default parameters. Reads less than 25bp are not aligned and flagged as unmapped. Fragments are identified as read pairs with MAPQ>30, non-mitochondria reads and not chimerically mapped. The start and end of the fragments are adjusted (+4 for + strand and -5 for -strand) to account for the 9bp region the transposase enzyme occupies during the transposition. Lastly, fragments with identical start and end positions were counted once. The most common barcode sequence is assigned to the fragments, with ties broken by picking the barcode sequence with the highest read counts. One of the read-pairs with that barcode sequence is labeled as the 'original' and the other read-pairs in the group are marked as duplicates of the fragment in the BAM file.

scATAC-seq Data Analysis Filtering Cells by TSS Enrichment and Unique Fragments
Enrichment of ATAC-seq accessibility at TSS was used to quantify data quality without the need for a defined peak set. Calculating enrichment at TSS was performed as previously described 48 , and TSS positions were acquired from the Bioconductor package from "TxDb.Hsapiens.UCSC.hg19.knownGene". Briefly, Tn5 corrected insertions were aggregated +/-2,000 bp relative (TSS strand-corrected) to each unique TSS genome wide. Then this profile was normalized to the mean accessibility +/-1,900-2,000 bp from the TSS and smoothed every 51bp in R. The calculated TSS enrichment represents the max of the smoothed profile at the TSS. We then filtered all single cells that had at least 1,000 unique fragments and a TSS enrichment of 8 for all data sets.

Generating a Counts Matrix
To make a cell by feature counts matrix, we first read each fragment into R using readr. Next, we converted these fragment GenomicRanges into Tn5 insertion GenomicRanges by concatenating GenomicRanges for each "start" and "end" of the fragments (1bp width). Next, we used "findOverlaps" to find all overlaps with the feature by insertions. Then we added a column with the unique id (integer) cell barcode to the overlaps object and fed this into a sparseMatrix in R. To calculate the fraction of Tn5 insertions in peaks, we used the colSums of the sparseMatrix and divided it by the number of insertions for each cell id barcode using "table" in R. The counts matrix was then log-normalized by using edgeR's "cpm(matrix , log = TRUE, prior.count = 3)" in R. The prior count is used to lower the contribution of variance from elements with lower count values as previously described. This normalization assumes that global differences in accessibility are minor to relative differences by depth normalizing within accessible regions.

Generating Union Peak Sets with Latent Semantic Indexing
We created a union peak set by adapting a previous workflow as follows 12 . Prior to calling peaks, we constructed 2.5kb windows that were tiled across the genome by using "tile(hg19chromSizes, width = 2500)" in R. Next, a cell by window sparse matrix was computed by counting the Tn5 insertion overlaps for each cell using "findOverlaps" in R as described above. This matrix was then binarized and pruned to the top 20,000 most accessible sites across all cells. We then reduced the dimensionality as previously described by computing the term frequency-inverse document frequency ("TF-IDF") transformation 9 . Briefly we divided each index by the colSums of the matrix to compute the cell "term frequency". Next we multiplied these values by log(1 + ncol(matrix) / rowSums(matrix)) which represents the "inverse document frequency". This normalization resulted in a TF-IDF matrix that was then used as input to irlba's singular value decomposition (SVD) implementation in R. We then retained only the 2nd-25 th dimensions (1 st dimension is associated with cell read depth 12 ) and created a Seurat object and identified crude clusters using Seurat's SNN graph clustering (v2.3) with "FindClusters" with a default resolution of 0.8. If the minimum cluster size was below 200 cells, the resolution was decreased until this criterion was reached leading to a final resolution of 0.8 N (where N represents the iterations until the minimum cluster size is 200 cells).
Peak calling for each cluster was performed independently to get high-quality, fixed-width peaks that represent the epigenetic diversity of all samples based on previous work 48 . For each cluster, peak calling was performed on Tn5-corrected single-base insertions (each end of the Tn5-corrected fragments) using the MACS2 callpeak command with parameters "--shift -75 --extsize 150 --nomodel --call-summits --nolambda --keep-dup all -q 0.05." The peak summits were then extended by 250bp on either side to a final width of 501bp, filtered by the ENCODE hg19 blacklist (https://www.encodeproject.org/annotations/ENCSR636HFF/), and then filtered to remove peaks that extend beyond the ends of chromosomes.
Overlapping peaks called within a single sample were handled using an iterative removal procedure as previously described 48 . First, the most significant peak is kept and any peak that directly overlaps with that significant peak is removed. Then, this process iterates to the next most significant peak and so on until all peaks have either been kept or removed due to direct overlap with a more significant peak. This was performed on each cluster's peak set and then the top 100,000 extended summits (ranked by MACS2 score) were kept to arrive at a "cluster-specific peak set" for each cluster. We then normalized the MACS2 peak scores (-log10(q-value)) for each sample and converted them to a "score quantile" by converting each individual score to a quantile using "trunc(rank(v))/length(v)" in R (where v represents the vector of MACS2 peaks scores). This normalization allows for direct comparison of peaks across clusters, enabling the generation of a union peak set for each dataset.
We next compiled a union peak set containing the important peaks observed across all clusters. First, all cluster peak sets were combined into a cumulative peak set and trimmed for overlap using the same iterative procedure mentioned above. Again, this procedure keeps the most significant (in this case, score quantile) peak and discards any peak that overlaps directly with the most significant peak. Lastly, we removed any peaks that spanned a genomic region containing "N" nucleotides and any peaks mapping to the Y chromosome.

Reads-in-peaks-normalized bigwigs and sequencing tracks
To visualize ATAC-seq cluster data genome-wide, we created ATAC-seq signal tracks that have been normalized by the number of reads in peaks, as previously described 48 .
Briefly, we created fragment files that contained all cells belonging to a specific cluster and then counted the number of Tn5 insertions in the corresponding peak set. The number of Tn5 insertions were computed in windows genome-wide using "slidingWindows(chromSizes,100,100)". Next, we created a run-length encode using "coverage" in R and normalized the total reads to a scale factor that normalizes the readsin-peaks to 10 million reads within peaks. This object was then converted into a bigwig using rtracklayer "export.bw" in R. For plotting tracks, the bigwigs were read into R using rtracklayer "import.bw(as="Rle")" and plotted within R or visualized with WashU Epigenome browser. All track figures in this paper show groups of tracks with matched normalized y-axis scales.
To visualize scATAC-seq data, we read the fragments into a GenomicRanges object in R. We then computed sliding windows across each region we wanted to visualize every 100 bp "slidingWindows(region,100,100)". We computed a counts matrix for Tn5corrected insertions as described above and then binarized this matrix. We then returned all non-zero indices from the matrix (cell x 100bp intervals) and plotted them in ggplot2 in R with "geom_tile".

ATAC-seq-centric Latent Semantic Indexing clustering and visualization
We clustered scATAC-seq data using an approach that does not require bulk data or prior knowledge. To achieve this, we adopted the strategy by Cusanovich et. al 9 , to compute the term frequency-inverse document frequency ("TF-IDF") transformation. Briefly we divided each index by the colSums of the matrix to compute the cell "term frequency." Next, we multiplied these values by log(1 + ncol(matrix) / rowSums(matrix)), which represents the "inverse document frequency." This resulted in a TF-IDF matrix that was used as input to irlba's singular value decomposition (SVD) implementation in R. We then used the first 50 reduced dimensions as input into a Seurat object and then crude clusters were identified by using Seurat's (v2.3) SNN graph clustering "FindClusters" with a default resolution of 0.8. We found that there was detectable batch effect that confounded further analyses. To attenuate this batch effect, we calculated the cluster sums from the binarized accessibility matrix and then log-normalized by using edgeR's "cpm(matrix , log = TRUE, prior.count = 3)" in R. Next, we identified the top 25,000 varying peaks across all clusters using "rowVars" in R. This was done on the cluster log-normalized matrix vs the sparse binary matrix because: (1) it reduced biases due to cluster cell sizes, and (2) it attenuated the mean-variability relationship by converting to log space with a scaled prior count. These 25,000 variable peaks were then used to subset the sparse binarized accessibility matrix and recomputed the "TF-IDF" transform. We used singular value decomposition on the TF-IDF matrix to generate a lower dimensional representation of the data by retaining the first 50 dimensions. We then used these reduced dimensions as input into a Seurat object and then crude clusters were identified by using Seurat's (v2.3) SNN graph clustering "FindClusters" with a default resolution of 0.8. These same reduced dimensions were used as input to Seurat's "RunUMAP" with default parameters and plotted in ggplot2 using R.
For sub-clustering analyses (Hematopoiesis: CD34 + BM and DCs; Tumor: T cells), we computed the cluster sums again and log-normalized using edgeR's "cpm(matrix , log = TRUE, prior.count = 3)" in R. We identified the top 10,000 and 5,000 varying peaks for CD34 + cells and T cells, respectively. These variable peaks were then used to subset the sparse binarized accessibility matrix and recompute the "TF-IDF" transform. We then used singular value decomposition on the TF-IDF matrix to generate a lower dimensional representation of the data by retaining the first 25 dimensions. We then used these reduced dimensions (1-25 and 2-25, respectively) as input into a Seurat object, and then crude clusters were identified by using Seurat's (v2.3) SNN graph clustering "FindClusters" with a default resolution of 0.8. These same reduced dimensions were used as input to Seurat's "RunUMAP" and plotted in ggplot using R.

Inferring copy number amplification
To infer DNA copy number amplifications from scATAC-seq data, we first tiled the genome into 10-Mb windows using "slidingWindows" of GenomicRanges for chromosome sizes in R with a step size of 2Mb. These window positions were then filtered against regions with known artefactual mapping issues using the ENCODE hg19 blacklist with the "setdiff" function in R. Then, a cell by window binarized matrix was constructed, as described above. Next, the insertions per bp was determined within each filtered 10-Mbp window. The percent GC content was computed for each filtered 10-Mbp window using the hg19 BSgenome in R. To estimate if a region is amplified, we identified the 100 nearest neighbors based on GC content and computed the average log2(fold change). If this was above 1 we considered this region as a candidate for amplification. By looking at the median log2(fold change) for each patient across each cluster we tried to identify amplified regions.

Transcription factor footprinting
We wanted to characterize relative TF occupancy through TF footprints, as previously described 48 . For each peak set, we used CIS-BP motifs (from chromVAR motifs human_pwms_v1) to calculate motif positions using motifmatchr "matchMotifs(positions = "out")". Next, we computed the Tn5 bias for each sample by constructing hexamer bias table using "oligonucleotidefrequency" function from Biostrings in R. Then, we calculated a hexamer table for each TF by counting the hexamers relative to each stranded motif position +/-250 bp from the motif center. Using the sample's hexamer frequency table, we could then compute the expected Tn5 insertions by multiplying the hexamer position frequency table by the observed/expected Tn5 hexamer frequency.
To assess the reproducibility of footprints, we subsampled each cluster fragments 2 times at a sampling rate of 60% to have maximum variability. To calculate the insertions around these sites, we converted the Tn5-corrected insertions GenomicRanges (see above) into a coverage run-length encoding using "coverage". For each individual motif, we iterated over the chromosomes, computing a "Views" object using "Views(coverage, motif positions)". This "Views" object was converted to a matrix using "as.matrix" which was then used to calculate the following and the colSums for "-stranded" motifs were reversed and the colSums for NOT "-stranded" motifs were summed. To better compare footprints across samples, we normalized these footprints by the mean values +/-200-250 bp from the motif center. Next, we divided the footprints by the expected Tn5 bias to attempt to account for the inherent Tn5 bias. While this strategy is effective, it does not fully account for all of Tn5's sequence bias. We then plot the mean and standard deviation for each footprint pseudo-replicate.

ChromVAR
In addition to TF footprinting, we measured global TF activity using chromVAR 4 . We used as input the raw insertion counts for all peaks and the CIS-BP motif (from chromVAR motifs "human_pwms_v1") matches within these peaks from motifmatchr. We then computed the GC bias-corrected deviation scores using the chromVAR "deviationScores" function. All plots used the "deviationScores" in R and variability was computed by using "rowVars" in R.

Computing Gene Activity Scores using Cicero Co-Accessibility
We calculated gene activities using the R package Cicero as previously described 20 . We first used the sparse binary matrix and created a cellDataSet, detectedGenes, and estimatedSizeFactors. Next, we created a "cicero_cds" with k=50 and the "reduced_coordinates" being the corresponding UMAPs. This function returns aggregated accessibility across groupings of cells based on nearest-neighbor rules. We then identified all peak-peak linkages that were within 250 kb by resizing the peaks to 250 kb and then overlapping them with the peak summits/centers. We removed all duplicates and same peak-peak links. Next, we calculated the pearson correlation for each peakpeak link and created a connections data.frame where the first column is peaki and the second column is peakj and third coaccessibility (pearson correlation). We then created a gene data.frame by getting genes from the TxDb "TxDb.Hsapiens.UCSC.hg19.knownGene" in R. We altered the start of "MEF2C" to 88014057 because this alternative transcript start site is where we see strong promoter accessibility. We then resized each gene to its TSS and created a window +/-2.5 kb from the TSS and then annotated our "cicero_cds" using "annotate_cds_by_site". We then calculated gene activities by using "build_gene_activity_matrix" with a coaccess cutoff of 0.35. Lastly we normalized the gene activities by using "normalize_gene_activities" and the read depth of the cells. We wanted to adapt these activity scores to be more interpretable as pseudo RNA-seq so we further log normalized them by computing "log2(GA*1000000 +1)" which we refer to as further in the text as log2(GA + 1) (analogous to logCPM transformation). This conversion made the scores more interpretable and allowed them to be used further in TF deduplication and cell annotations.

GWAS PICs Causal Autoimmune Variants using Cicero Co-Accessibility and chromVAR
We sought to characterize cell-type specific enrichments in known disease associated regulatory elements. To do this analysis, we downloaded the causal SNPs for 39 diseases from http://pubs.broadinstitute.org/pubs/finemapping/dataportal.php. We then converted these into a GenomicRanges object and overlapped them with ATAC-seq peaks. We wanted to increase our power beyond direct overlaps, thus we used our cicero coaccessibility links to extend direct overlaps to peaks that are also co-accessible. To do this analysis, for every peak that had an overlap, we took all peaks that were coaccessible above 0.35 and then marked them as overlapping. We then created an overlap matrix for every group of SNPs partitioned by disease. We then used this as input to chromVAR's "computeDeviations" with the scATAC-seq raw insertion counts for all peaks. We then used the "deviationScores" from chromVAR and plotted the median score across each cluster in R.

HiChIP MetaV4C support for Cicero Co-Accessibility Links
We wanted to further support predicted Cicero co-accessibility links using previously published 3D chromosome conformation data as previously described 48 . Briefly, we used published histone H3K27ac HiChIP data from primary T cell subsets 23 (Naive, Th17, and Treg) to support our predicted Cicero co-accessibility links. First, we converted our links to 10kb resolution by flooring each coordinate (gene start and peak center) to the nearest 10kb window and deduplicated. To make distance-scaled Meta-Virtual 4C plots, each chromosome was retrieved from the ".hic" interactions file using juicer dump at 10-Kb resolution and read into a "sparseMatrix" in R (each coordinate in the matrix corresponding to a 10-Kb interaction bin). Then, for each peak-to-gene link longer than 100kb, the upstream or downstream window (depending on the peak's location relative to the TSS) was identified and then interpolated linearly using the "approx" function to get the value at each 0.1%. This was then summed for each predicted interaction and divided by the total number of predicted interactions. Replicate reproducibility was visualized with the mean profile shown as a line and the shading surrounding the mean representing the standard deviation between replicates. Lastly, we wanted to test the specificity of our scATAC-seq T cell clusters in Naïve, Th17 and Treg H3K27ac HiChIP. We computed the cluster sums for each cluster from the binarized accessibility matrix for clusters 21 (Naïve), 24 (Memory) and 25 (Treg), log-normalized and computed the row-wise zscores. We then took the top 25,000 peaks by Z-score for each cluster and then overlapped these with the cicero links. We then plotted the metav4C for each of the 3 subtypes with replicate reproducibility as described above.
Overlap of Cicero Co-Accessiblity Links with GTEx eQTLs eQTLs from the Genotype-Tissue Expression project were used to support the ATAC-seq defined Cicero co-accessibility links as previously described 48 . First, we identified all gene starts from gencode v19 (https://gtexportal.org/home/datasets) and extended them +/-2.5kb, as we did when computing the gene activities, and then overlapped all peaks with these regions using findOverlaps. We then labeled peaks that overlapped with the extended gene starts as a promoter peak and identified all ATAC-seq peak-to-promoter links. We chose to do this analyses with gencode v19 because we wanted the genes to match with the eQTL data set. GTEx eQTL data (version 7) was downloaded from https://gtexportal.org/home/datasets and the *.signif_variant_gene_pairs.txt.gz files were used. All eQTLs located more than 250kb away from the predicted gene pair were removed to maintain consistency with the 250kb window used in predicting ATAC-seq peak-to-promoter links. Next, the nearest genes for each eQTL were determined using "distanceToNearest" with the eQTL regions to all gene starts in gencode v19 in R. Then, all eQTLs that were paired with the nearest gene were removed to better test the predictive power of the non-nearest gene peak-to-gene link predictions. All peak-to-gene links were then overlapped with these filtered eQTLs using "findOverlaps" and then matched based on the predicted linked gene. To assess the significance of these overlaps, we created 250 random peak-to-gene link sets by taking all peaks from the hematopoiesis union peak set and randomly assigning these peaks to any gene within 250 kb of the peak summit. Then, we calculated the z-score and enrichment of our determined peak-to-gene links compared to the randomized peak sets. We then calculated the adjusted p-value using the Benjamini-Hochberg correction.

Constructing ATAC-seq Pseudo-Bulk Replicates of Maximal Variance
We wanted to perform analyses that treated each cluster as a bulk ATAC-seq sample, but required a method that could create replicates that convey close to the true population variance within a cluster and potential batch effects. For each cluster, we first checked whether each cluster contained 2 or more individual 10x runs that had at least 100 cells within the cluster. If true, then individual 10x runs cells were summed from the binarized matrix (maximum of 500 cells randomly sampled) to create pseudo-bulk replicates. If this condition was not true for the cluster, and there was at least one 10x run that had at least 100 cells and the remaining cells were at least 100 cells then one 10x run was summed from the binarized matrix (maximum of 500 cells randomly sampled) to create a pseudobulk replicate. Then for the remaining cells a pseudo-bulk replicate was constructed by randomly sampling 100 cells from the remaining cells 250 times and we kept the sampling that produced the highest within cluster total log-variance (cpm(mat, log=TRUE, prior.count=3) then summed "rowVars" with the 1 10x run replicate). If neither of these conditions were met, then two replicates of 100 cells were randomly sampled 250 times and we kept the sampling that produced the highest within cluster total variance (cpm(mat, log=TRUE, prior.count=3) then summed "rowVars" with the both sampled replicates). Lastly, if the cluster was smaller than 150 total cells, the number of cells sampled was 2/3 the size of the cluster. This workflow was designed to construct pseudoreplicates from single-cell clusters that produced high variance to attempt to capture true biological variation. In general, this approach is still an underestimate of variation when fully simulating replicates, so it's important to be conservative when using these pseudoreplicates in further analyses.

Constructing Gene Activity Pseudo-Bulk Replicates of Maximal Variance
We wanted to perform analyses that treated each cluster as a bulk ATAC-seq sample, but required a method that could create replicates that convey close to the true population variance within a cluster and potential batch effects. For each cluster, we first checked if within each cluster there were 2 or more different 10x runs cells that had at least 100 cells within the cluster. If this condition was true, then these individual 10x runs cells were averaged from the log(GA + 1) matrix (maximum of 500 cells randomly sampled) to create pseudo-bulk replicates. If this condition did not satisfy for the cluster, and there was at least 1 10x run that had at least 100 cells and the remaining cells were at least 100 cells then the 1 10x run was averaged from the log(GA + 1) (maximum of 500 cells randomly sampled) to create a pseudo-bulk replicate. Then for the remaining cells a pseudo-bulk replicate was constructed by randomly sampling 100 cells from the remaining cells 250 times and we kept the sampling that produced the highest within cluster total log-variance (summed "rowVars" with the 1 10x run replicate). If that condition was not met then two replicates of 100 cells were randomly sampled 250 times and we kept the sampling that produced the highest within cluster total variance (summed "rowVars" with the both sampled replicates). Lastly if the cluster was smaller than 150 total cells the number of cells sampled was 2/3 the size of the cluster. This workflow was designed to construct pseudo-replicates from single-cell clusters that had high variance to attempt to capture real biological variation. In general, this approach is still an underestimate of variation when fully simulating replicates, so it's important to be conservative when using these pseudo-replicates in further analyses.

Identification of cluster-specific peaks and gene activities through feature binarization
Once we had determined clusters from scATAC-seq data, we sought to identify peaks that were uniquely present within each cluster or combination of clusters. We modified a previously described approach to binarize each feature and then identify unique features in a simplistic manner 48 . For ATAC-seq, we used bulk pseudo-replicates and lognormalized the matrix using "edgeR::cpm(mat,log=TRUE,prior.count=3)". For gene activities, we used bulk pseudo-replicates and re-normalized the matrix by first converting to gene activities and then converting back to log by computing "log2(edgeR::cpm(2^logMat-1)+1)". Next we computed the mean and sd for each cluster using "rowMeans" and "rowSds" respectively in R. Next, for each feature peak or gene we ranked the clusters by their intra-cluster mean. Then, we iterated from the second lowest cluster asking whether the mean of that cluster: (1) is greater than the maximum intra-cluster mean plus the 0.5 times the intra-cluster standard deviation of the nextlowest cluster (1 sd for sub-cluster binarization), and (2): is greater than log2FC to the maximum intra-cluster mean of 0.25. This process was continued and the last time this criterion was met, the break point is labeled and all clusters above this intra-cluster mean were marked with a "1" and below a "0". If a peak does not have a break point, it is discarded. This binarization will capture peaks that are unique to multiple groups. Next, all classified peaks/genes that correspond to more than a total of the floor of a third the clusters used as input. Next, for each peak/gene a t-test was computed comparing all "1's" and "0's" and the p-values were adjust for multiple hypothesis through the Benjamini-Hochberg correction by "p.adjust(method="fdr")". All peaks/genes that had an adjusted pvalue below 0.01 were kept. Lastly, we filtered out all binarization patterns that were classified less than 25 times. These were then plotted in R using the package ComplexHeatmap.

Pseudotime Analysis
To order cells in pseudotime, we sought to create a trajectory that we could align cells to. We chose to use UMAP for alignment if cells were part of a continuous substructure, since local distances are better preserved. First, we described candidate trajectories by ordering clusters. Next, for each cluster we calculated the mean coordinates in both dimensions and filtered cells that were in the top 5% euclidean distance to the mean coordinates. Next, we computed the distance for each cell from clusteri to the mean coordinates of clusteri+1 along the trajectory. We then computed a pseudotime vector by calculating the quantiles for each cell by their distance to the next cluster and added the current iteration. This allowed us for each of the cells (not filtered), have a UMAP coordinate and a time component. Next, we fitted a continuous trajectory to both UMAP coordinates using "smooth.spline" with dof =250 and spar = 1. Then, we aligned all cells to the trajectory by their euclidean distance to the nearest point along the manifold. We then computed then scaled this alignment to 100 and used this as pseudotime for further analyses.
To further support longer trajectories in pseudotime, we wanted to test whether the trajectory is significant by its ordering. To test our trajectories, we took the latest cluster and then ranked the top 10,000 accessible peaks and computed the euclidean distance to all other clusters (logCPM). We then continued in this reverse trajectory and computed the distance to all other clusters that do not include the previous clusters for directionality.
To determine the significance of the ordering, we then permuted the ordering of the trajectory 5000 times and computed the average rank of the ordering for the permuted and input trajectory. This allowed for an empirical p-value calculation that we can then assign signified to this reduced dimension trajectory from the original accessibility matrix.
We then sought to create matrices that convey the pseudotime trends across features.
To do this analysis, we ordered the cells by their pseudotime and fit a smoothed line for each feature by using "geom_smooth" with method "gam" and formula "y ~ s(x, bs = "cs")" and n = 100. For peaks, this was done on the binarized sparse accessibility matrix, gene activities done using the log-normalized (log(GA + 1)) matrix, and chromVAR deviation scores were used in pseudotime analyses. We wanted to de-duplicate chromVAR CIS-BP motifs by correlating the gene activities of a TF to the inferred activity of chromVAR. We correlated TFs and their corresponding gene activities and then using cor.test in R kept the associated p-value and adjusted for multiple hypothesis through the Benjamini-Hochberg correction by "p.adjust(method="fdr")". Next, we computed the quantile for each TFs gene activity average and variance. We then averaged these quantiles to equally weight the log-average gene activity and log-variance. We then filtered the top 25% of TFs by this criterion and then further by TF-motif pairs that were correlated above 0.35 and FDR < 0.001. These were then used to represent TF-motif pairs identified are more likely to be involved in gene regulation across the identified pseudotime.

Barnyard Mixing Analysis
We sought to assess the rate at which multiplets (more than one cell per droplet) occur at different cell loadings to estimate the relative multiplets in our data. To calculate this rate, we performed human (GM12878) and mouse (A20) mixing experiments at loadings of 500, 1,000, 5,000 and 10,000 cells. We aligned them using 10x Cell Ranger. We then filtered cells as described above for both hg19 and mm10 genomes (TSS enrichment of 8 and 1,000 unique fragments). Next, we determined the effective multiplet rate first by computing the fragments to each genome to the combined fragments and then labeled multiplets by assigning cells that have fractional fragments less than 0.95 to either hg19 or mm10.
We then wanted to determine the effect of different numbers of cells and unique fragments used for peak calling by down sampling cells and unique fragments. We did this analysis by merging all GM12878 and A20 identified fragments from the mixing experiments into 1 fragments file. Next, we down sampled the fragments file by first the number of cells and then the number of fragments to make the unique fragments per cell match the desired output. We then called peaks on each of these fragments by creating a bed file of the Tn5 insertions (ends of the fragments) with MACS2 callpeak command with parameters "--shift -75 --extsize 150 --nomodel --call-summits --nolambda --keep-dup all -q 0.05." The peak summits were then extended by 250bp on either side to a final width of 501bp, filtered by the ENCODE hg19 blacklist (https://www.encodeproject.org/annotations/ENCSR636HFF/), and then filtered to remove peaks that extend beyond the ends of chromosomes. We then took the top 100,000 non-overlapping extended summits, as previously described 48 . We repeated this on the total fragments file to get a GM12878 and A20 peak set. Next then computed the fractional of peaks recovered by using "countOverlaps" and dividing by the extended summits of the total fragments file fin R. Lastly, we then counted the number of Tn5 insertions for each down sampled fragment file within the GM12878 and A20 peak set, log-normalized the matrix with "edgeR::cpm(mat, log = TRUE, prior.count = 3)" and computed the pearson correlation. We then plotted the results in R using "ggplot".

Frozen vs Fresh Analysis
We wanted to compare the effect of sample preparation on scATAC-seq data quality. Thus, we performed scATAC-seq on PBMCs that were freshly isolated, frozen, and frozen sorted for live cells. We then filtered the cells as described above (TSS enrichment 8 and 1,000 unique fragments) and then used our TF-IDF cluster peak calling framework as described above to get a peak set for each experiment. We then subsequently used our TF-IDF cluster analysis as described above (top 25,000 variable peaks with SVD dimensions 1-50) and computed clusters for each experiment. We then computed the Receiver Operating Characteristic (ROC) curve for both frozen samples against the fresh sample by using "overlapsAny" and ranking the peaks by MACS2 score in R. We then computed PCA with "prcomp" and correlations between the identified clusters and entire experiment within the fresh samples ATAC-seq peaks.

Spike-Ins Analysis
We sought to test our sensitivity with our analysis workflow by performing scATAC-seq on monocyte and T cell mixtures at various loadings. We then filtered the cells as described above (TSS enrichment 8 and 1,000 unique fragments) and then used our TF-IDF cluster peak calling framework as described above to get a peak set for each experiment. We then subsequently used our TF-IDF cluster analysis as described above (top 25,000 variable peaks with SVD dimensions 1-50) and computed clusters for each experiment. We then used "RunUMAP" with default parameters from Seurat (v2.3) to compute a UMAP for each spike-in experiment. We then computed the gene activities scores as described above by using the full hematopoiesis peak set, accessibility matrix, and co-accessible links and added each sample individually for calculating the gene activity scores. We then computed a monocyte score by taking the log(GA+1) average for CD14, MAFB, HLA-DRB1, TREML4, CSF1R, CEBPA, TLR4, HLA-DRA, and CD74. Lastly, we computed a T cell score by taking the log(GA+1) average for CD3E, CD2, CD5, CD7, IL7R, IL2, TCF7, CD3D, and CD3G.

Data Availability
All single-cell sequencing data have been deposited in the Gene Expression Omnibus (GEO) and are awaiting accessioning.       Shown are the number of unique ATAC-seq nuclear fragments in each single cell (each dot) compared to TSS enrichment of all fragments in that cell. Dashed lines represent the filters for high-quality single-cell data (1,000 unique nuclear fragments and TSS score greater than or equal to 8). (e) single-cell ATAC-seq data quality control filters in profiles generated using the C1 microfluidic system 11 (Fluidigm; left) or sci-ATAC-seq 12 (middle and right panels). scATAC-seq data, clustered as indicated in Figure 4b. Arrows indicate the position and distance (in kb) of intragenic or distal enhancers in each gene locus.