Abstract
It is now easy to perform multiome single-cell analysis, including both RNA and ATAC readouts from the same cell. This enables a closer linkage between the two types of modalities, but it remains an open question what more information can be extracted from this type of data. ATAC-seq is normally only used to assay transcription factor binding to open regions. By reanalyzing several large datasets, and generating an atlas of B cells, we show that telomere accessibility can better pinpoint processes related to cell cycle and chromatin condensation. We provide Telomemore, a tool that can extract telomeric reads, and give examples of new findings it enables. Our new findings will aid in the annotation and analysis of single-cell ATAC or multiome datasets.
Introduction
ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) is a widely used method for finding open chromatin regions1. With this readout also becoming available in the commercial 10x Genomics “multiome” single-cell platform, ATAC-seq can be expected to become even more popular. It is thus important to understand how such data should be interpreted. ATAC-seq is based on the enzyme Tn5 fragmenting and tagging (“tagmenting”) chromatin, which is then sequenced (Figure 1a). The common analysis pipeline results in a dimensional reduction and clustering of similar cells. The accessibility of regions can then be compared, and in particular, transcription factor (TF) activity can be quantified based on motif presence in accessible regions2. Thus upstream regulators of different cell types and fates can be discovered.
(a) ATAC-seq is a modern method in which accessible chromatin is “tagmented”, i.e. fragmented by the enzyme Tn5 that also adds adapters for sequencing. These fragments are normally used to analyze enhancers. (b) Not all ChIP-seq peaks are present in ATAC-seq, and the reads outside the peaks are normally ignored. (c) The telomere is tagmented, despite being protected by the shelterin complex4. Some potential sites of tagmentation have been highlighted. (d) Reads mapped to the T2T reference genome, showing a tightly closed centromeric region. (e) Abundance of telomeres in different ATAC-seq-containing datasets. The B cell multiome dataset is included in this study; fibroblast and T cell datasets are under preparation. (f) Most telomeric reads contain sufficient sequence differences that deduplication is possible. The red line represents no duplicates. Each library has a different offset, suggesting the need for batch correction.
There are however several open questions about the interpretation of ATAC-seq data. For example, fragments mapping outside peak regions usually remain uninvestigated, and ChIP-seq peaks also exist in places without ATAC-seq peaks3. This suggests that some TF-bound regions are inaccessible (Figure 1b). We thus set out to explore what other information can be extracted from ATAC-seq data. We show that telomeres are tagmented, despite being protected by the shelterin complex4 (Figure 1c). Furthermore, we demonstrate that the telomere fragment abundance is reflected by chromatin condensation, and interpret this in terms of cell cycle progression. Because the cell cycle is linked to many biological processes, and confounds single-cell analysis, telomere accessibility is of wide utility in interpreting single cell data. We provide a new tool to compute it, Telomemore, and precomputed telomere abundances for several large public datasets. Finally, we provide a new single-cell atlas of human tonsil B cells and show that telomeres help contextualize genes related to somatic hypermutation.
Results
Telomeric but not centromeric regions are tagmentable in cells
With the new complete genome sequence from the telomere-2-telomere consortium5, we could assay the origin of ATAC-seq reads after 10x chromium library preparation. Centromeres are largely inaccessible (Figure 1d). We however noted the presence of telomeric reads. To annotate telomeres at high precision, we implemented a telomere counter (Telomemore) for single-cell ATAC. We define normalized telomere accessibility (nTA) as telomere counts divided by cell total ATAC read counts. A comparison of nTA for different datasets is shown in Figure 1e. We find a large spread in telomere coverage but it is the highest in samples prepared using the 10x chromium RNA+ATAC multiome kit v1. All single-cell methods capture more telomeres than one large bulk ATAC PBMC dataset6.
Because individual cells have few telomere fragments, and they are not pure repeats of [TTAGGG]n, it was possible to deduplicate the reads based on the sequence. In PBMC multiome data, with 3,800 UMI/cell and 26,000 peaks/cell (after filtering), we find approximately 40% unique telomere reads and a near linear correlation with unduplicated telomere count (Figure 1f). Thus in applications where the telomeric reads are of high importance, deeper sequencing may be beneficial. We also find global differences between PBMC datasets that may need to be batch corrected. We therefore use the within library-rank percentile to correct nTA. The telomere counts, for each cell of the Figure 1g datasets, are provided as Supplementary data. For brevity, only a number of these datasets are described in this study.
Telomere accessibility is linked to chromatin condensation, but not to telomere length
We expected telomere read abundance to be driven by telomere length, as longer telomeres can theoretically be divided into more fragments. To test this hypothesis, we used the common assumption that telomere length decreases with age. By extracting nTA from bulk ATAC-seq of PBMCs from 54 men and 66 women6, we could however conclude that nTA is not significantly correlated with age, and thus not with length (Figure 2a). This was true also after filtering out datasets that, according to the mapped read length distribution, had differences in library size selection.
(a) Bulk ATAC-seq nTA across PBMCs from donors of different age; no significant correlation is seen, and if any, it is not in the expected direction if driven by telomere length. (b) The bulk ATAC-seq nTA from cells sorted by cell cycle stage7. (c) Bulk ATAC-seq nTA in human and mouse CD4 T helper type 2 cells during the first 72h of activation3. The nTA drops as expected if it is driven by other enhancers opening as part of decondensation (green arrows are conceptual only). Transmission electron microscopy images showing the rapid chromatin decondensation have been generated previously8. (d) Analysis of sequencing data shows clear Tn5 sequence preference. Reads have been aligned based on GC-richness. Telomeres are not perfect repeats, degenerating the pattern over distance.
To clarify this lack of correlation, we extracted telomeres and interpreted them from a large number of bulk ATAC-seq datasets. Since telomeres are typically shortened during cell replication, we found a dataset where cells that had been sorted by cell cycle stage using flow cytometry of particular interest7. Our analysis shows that nTA was reduced in the S-phase (Figure 2b). We therefore hypothesize a model in which the telomere accessibility is constant, but overall chromosome accessibility is not. In such a model, nTA should increase during chromatin condensation. This hypothesis is further supported by time course ATAC-seq data of mouse and human CD4 T cells undergoing in vitro activation (Figure 2c). Naive T cells, until activated, remain in a condensed dormant condensed state which can even be seen using transmission electron microscopy (TEM)8. During activation, and entry into the cell cycle, nTA decreases, as predicted by the model.
Collectively, our analysis demonstrates that chromatin condensation is a driver for nTA, but not telomere length in any significant manner. This is contrary to our initial expectation, as the length varies approximately 5-15 kb between humans and is also cell type dependent9 - thus providing up to a factor 3x difference. A potential explanation would be if only a fixed length part of the telomere is accessible to tagmentation, such as near the T- or D-loops (Figure 1c), or in a specific part of the subtelomeric region (some subtelomeric patches are accessible to m6A-MTase labeling10). An analysis of the telomere fragment base frequency shows that the reads consist of well-aligned [CCCTAA]n repeats (Figure 2d). The strong pattern shows that Tn5 has a strong preference to tagment before the CCC site. Since Tn5 normally prefers a G at the first base11, also for naked DNA, we cannot tell whether the shelterin proteins TRF1/TRF2, which target TTAGGG, affect the Tn5 sequence preference. The precise action of Tn5 on the telomere thus cannot be concluded given current data, and we thus simply accept the empirical trend of higher nTA in more condensed chromatinTelomere accessibility is linked to stemness and differentiation
The cell cycle is a natural driver of chromatin condensation, but not the only one. To get an overview of telomere accessibility, we particularly reanalyzed publicly available multiome datasets of PBMCs (25,000 cells). We focused on T cells (Figure 3a) and monocytes (Figure 3b), annotated based on previous atlases (see Methods). To find drivers for nTA, we calculated the correlations between nTA, gene expressions and motif activities (Supplemental File 1).
(a) UMAP of NK and T cells from multiome PBMC data. Effectorness/cytotoxicity axis along which nTA is correlated. (b) UMAP of monocytes, where nTA is up in CD14loCD16+ non-classical monocytes (CD14=FCGR3A). Cell cycle inhibitor CDKN1C is up in this subset which is not reflected in the standard RNA-seq based cell cycle inference method.
CD4 and CD8 T cells form distinct clusters but have the same most nTA-anticorrelated genes: LEF1 (CD8: −45%, CD4: −19%) and BACH2 (CD8: −50%, CD4: −18%). Similarly, the activity of the TCF1/LEF1 motifs are also the most anticorrelated (CD8:-49%, CD4:-19%). LEF1 binds to the minor groove of DNA, bending it, and is known to regulate large scale chromatin organization in CD8 T cells12. CD8-Cre Tcf1/Lef1 KO has shown the need of LEF1 in maintaining CD8 identity, otherwise leading to expression of CD4-related genes and aberrant cytotoxicity13. Similar to LEF1, BACH2 also drives the program for stem-like CD8 T cells14. The lower nTA suggests that stem-like CD8 T cells have an open chromatin. On the contrary, CCL5 and several granzymes (GZMA, GZMK, GZMH) are among the most positively nTA-correlating genes. NK cells, which are also cytotoxic, and separate from CD4 and CD8, have the highest nTA. The chromatin of NK15 cells can be compared to T cells8 using TEM; resting NK cells have a highly compact chromatin, similar to naive T cells.
All monocytes are generally characterized by bilobed horseshoe shaped nuclei, but there are several subsets with different functions16. In this dataset, non-classical CD14loCD16+ monocytes have a much higher nTA. The second most positively nTA-correlated gene is CDKN1C (24%), which inhibits proliferation during the G1 phase17. This is at odds with the RNA-based annotation that shows a reduction in G1 phase. Non-classical monocytes tend to patrol capillaries16 and we speculate that the increased chromatin compactification may help their navigation; this would be similar to neutrophils, which also have multilobed chromatin that let them enter small gaps in the endothelium and extracellular matrix.
To summarize, nTA pinpoints several interesting cases of chromatin compactification. For the immune system, it can map to an axis from stemness to a resting state, but also predicts nuclei morphological differences that may be linked to the function of the respective cell type.
Telomere accessibility across tissues and cell types pinpoint cell cycle related transcription factors
To obtain a global overview of TA in healthy adult humans, we investigated a recent atlas of the chromatin accessibility in the human genome, using sci-ATAC-seq18. The nTA of the publicly available subset (500k cells) is shown cells vs tissues in Figure 4a. Naive T cells have a high nTA reflecting the previous observations on their condensed state. To find common drivers of nTA, we set up two linear models vs TF motif activity. The results are similar, both with and without regressing cell type specific effects (Figure 4b-c, values in Supplemental File 2). At the global level, the activity of motifs TP63, GRHL1 and FOSL2 positively correlate with nTA and under chromatin condensation. Among the TP members, having similar motifs, the tumor suppressor and cell cycle inhibitor p53 is the most famous. Knock-out of GRHL1 also leads to cell cycle arrest in G2/M phase19, this most condensed phase. Conversely, AR, MEF2D and NR3C1 drive chromatin decondensation. The androgen receptor (AR) is important for breast and prostate cancer proliferation20. The pleiotropic MEFs are activated at the G0/G1 transition, and promote S-phase entry21. Overall the motifs point can be related to cell cycle regulation and are consistent with the interpretation of nTAhi as a marker for chromatin condensation. However, we expect that more information can be extracted, e.g., GRHL1 has also been linked to aging, although in an insulin related manner22. Per-cell nTAs are available on Github for further exploration.
(a) Normalized telomere accessibility across human cells and tissues18. (b) Linear model showing motifs of enhancers that are correlated with nTA. (c) Linear model also removing cell type specific differences. (d) Motif sequences of some of the top drivers of nTA.
Telomere accessibility elucidates B cell somatic hypermutation
We reanalyzed a previous RNA-only single-cell B cell dataset23 and noticed a strong influence of the cell cycle. Inspired by this, we generated a new 10x multiome dataset of human tonsillar B cells from 5 individuals, aiming to better understand somatic hypermutation (SHM) and class switch recombination (CSR) in the germinal center (GC). About 7,200 B cells were kept after annotation and in silico filtration, covering follicular and germinal center B cells (Figure 5a-b). Overall we find good agreement between clustering based on ATAC-seq vs RNA-seq, with some exceptions; B cells with interferon response (IFI44L+, IRF-family motifs also enriched) are scattered in the ATAC-seq UMAP. The naive B cells are loosely split into DNMT3Ahi/lo cells in RNA and ATAC, but the differences are small. A large group of MLLT3hiAKAP6hi naive B cells are further away from GC. These possibly reflect a house keeping state: MLLT3 is part of the polymerase super elongation complex, while AKAP6 has been linked to non-centrosomal microtubule-organizing centers (MTOCs) during differentiation24. These might thus be the most naive B cells in our dataset. Naive cells, which are all Ccnd3hi, funnel through a Ccnd2hi state (“Follicular to GC”) which form a bridge toward the GC. This bridge is up in activity for the NFKB2 and JUNB families. Coming back from the GC, there are 3 memory cell clusters in order of proximity to the GC: FCRL4hi, TOXhiFOXP1lo), and TOXloFOXP1hi. Despite no difference in cell cycle composition, these clusters are all nTAhi, with FCRL4hi being the most condensed; this may possibly be linked to the fact that FCRL4+ B cells are considered to be exhausted, or have a hyporeactive phenotype25.
(a) UMAP of the different B cell subtypes. (b) Expression levels of some marker genes for the different clusters/cell types. (c) UMAP of the GC, with AICDA expression and telomere accessibility. (d) Marker genes of the GC.
We separated the GC to obtain better resolution (Figure 5c-d). Leiden clustering splits it primarily by the cell cycle phase, and at higher resolution by Ig class. The LZ-DZ gradient aligns with CD83-CXCR4. However, based on marker genes, the several orders of B cell cycling causes Leiden to possibly not recapitulate the GC maturation process26 well. Most cycling cells are in the DZ, but MAML3/DNER (Figure 6a) delineate a large cycling Notch-signalling population (DNER+ possible late-stage), previously associated with the LZ27. These are enriched in class-switched IGHA* and IGHG*. Complementary to MAML3+ is a population of cycling unswitched IGHM+, frequently also expressing CD44. These are all CUX1lo, contrary to the idea of CUX1 expression being cell-cycle dependent28. However, CUX1 also speeds up proliferation28, which can be related to the known acceleration of B cell proliferation29.
(a) Volcano plot of correlations between TF gene expression and corresponding motif activity. (b) Motif activities of TFs across clusters. (c) Gene expression patterns, not always well captured by per-cluster summaries.
The cluster of hypermutating AICDA+ cells is CD22loFOXP1loBACH2hiKLF12lo. However, nTA is higher in AICDA+ G2M cells, continuing into a separate cluster (from “GZ DZ G2M” to “GC DZ AICDA*”, Figure 5c). Microscopy has shown that the mutagenase AICDA enters the B cell nucleus during mitosis, and that it acts during early G1 phase30. This suggests that AICDA expression does not need to overlap with its activity, and that nTA might be a more relevant measure of activity - in particular as chromatin condensation may shield other genes from mutagenesis, not just precisely during mitosis. AICDA level is also known to increase with every cell division, known to be correlated with BCR affinity, here possibly represented as an extended nTA/AICDA gradient toward the LZ clusters31. A pre-apoptotic subcluster is also present, being BMP7hi (which is pro-apoptotic32) and ZNF385Bhi (predicted to be involved in p53 binding).
The condensation is correlated with several genes relevant to proliferation and cell cycle (Figure 6a); The top most correlating gene with nTA is however the actin polymerization regulator KANK1 (22%), for which KO can lead to cell proliferation33. nTA also correlates with CDK13 which has been shown to increase Pol II processivity34. GWAS has associated CDK13 with the amount of IgD+CD38dim, and IgD+CD24-B cells35, or lymphocyte count in general36,37. It is therefore possible that CDK13 upregulates DNA damage response genes, as previously shown for CDK1238, and thereby help control the dose of SHM.
Because SHM and CSR depend on base excision repair (BER) and mismatch DNA repair (MMR), we specifically looked at the genes involved. These do not generally overlap with AICDA, with several genes limited to, or near, the S-phase (BER: POLQ, UNG, MSH2. MMR: LIG1, MSH2, MCM8, POLD3, PCNA). This suggests that BER/MMR happens separately from AICDA mutagenesis, after decondensation. AICDA targets also the telomere regions (WRCY motif), and these are lost without repair by UNG39. However, as the telomeres get spliced until UNG removes the uracil bases, which happens in the S-phase, we can rule out that free telomere fragments contribute to nTA during the mutagenesis. Overall our analysis shows that telomere accessibility can be given an interpretation that helps pinpoint AICDA activity during condensation of the B-cell chromatin.Discussion
In this study we have shown that telomeric reads can be useful to assay chromatin condensation. This process is in turn intimately linked to the cell cycle and of broad relevance to single-cell data analysis. There is thus value in combined single-cell RNA and ATAC-seq beyond just transcription factor activity analysis. We have presented a tool, Telomemore, that enables easy extraction of this data from 10x datasets. Furthermore, we provide this data from several published dataset, covering over a million cells. As single-cell ATAC-seq data alone can be difficult to interpret, we have also generated new multiome data where the cell cycle is a dominant component. This data will likely prove useful for further methods development in the single-cell domain.
An alternative to measuring telomere accessibility may have been to extract RNA-seq reads of TERRA40 (Telomeric repeat-containing RNA). However, k-mer based searches in several 10x single-cell RNA-seq datasets typically yielded fewer than 5 potential reads/dataset. Thus the RT-capture efficiency of the current 10x RNAseq chemistry is insufficient, and ATAC-seq is necessary to capture telomeric information.
The production of this manuscript highlighted the need for the FAIR principles (Findability, Accessibility, Interoperability, and Reuse). Much single-cell data has only been made available as processed count tables with the implicit assumption that it contains all interesting information. However, others have made use of mtDNA to perform lineage tracing41, and we provide further uses of the raw data. We thus urge others to always release their sequencing data as raw FASTQ, and ideally also deposit it in the Human Cell Atlas to aid reprocessing. Finally, much effort of this study went into data access agreements, and this study would not have happened unless we had easy access to primary data. Thus we call for a discussion on what data should be considered “sensitive”, as needlessly hiding raw data slows down research and is against the interest of those benefiting from drugs derived from analysis of human single-cell data.
Online Methods
Analysis of bulk ATAC-seq
CCCTAA-motif scanning was performed using a custom python script on the first reads in the file. For the PBMC vs aging dataset6, we further investigated if the quality of the library impacted the telomere readout. Fragment size distributions were inspected using ATACFragQC v 0.4.5, showing differences probably caused by different size selection procedures. We removed outlier samples based on nucleosome signal, that is, the ratio of fragments between 147 bp and 294 bp (mononucleosome) to fragments < 147 bp (nucleosome-free). No qualitative difference was noticed after clean-up, nor was there a correlation between nTA and nucleosome signal.
The pattern analysis was done using several libraries from the bulk PBMC vs aging dataset6. Since the libraries are paired end, R1 and R2 were selectively swapped based on G vs C-frequency to effectively align them. A custom R script was used to estimate frequencies at each position.
Obtaining of existing single-cell data
Suitable datasets for reanalysis were found from the 10x genomics publication database. Datasets on NCBI SRA were inspected using the online preview to check availability of reads. In some cases the cell barcodes were missing and these datasets are thus impossible to use. In some cases the originally submitted data can be obtained through cloud retrieval but we prioritized other datasets whenever possible. In SRA, cell barcode reads are inconsistently denoted as “biological” or “technical”. To ensure that we obtained the cell barcodes, we retrieved the data using the command “fasterq-dump SRRxxx -e 10 -v --include-technical --splitfiles”.
Furthermore we processed and broadly analyzed the following datasets: Delacher202142 (GSE156112), Jain2021_pbmc43 (E-MTAB-11225, E-MTAB-11226), Jain2021_thymus43 (E-MTAB-9828, E-MTAB-9840), Lyu202144 (GSE183684), Morabito202145 (GSE174367), Sarropoulos202146 (E-MTAB-9765), Satpathy201947 (GSE129785), Taavitsainen202148 (GSE168667), Wimmers202149 (GSE165904), Ziffra202150 (GSE163018) and Zhang202118 (GSE184462).
The Satpathy2019 GEO dataset GSE129785 was not compatible with ArchR. Also, as cell barcodes were missing in the SRA upload, SRA-to-AWS cloud delivery was used. The BAM files were unaligned to FASTQ using https://github.com/10XGenomics/bamtofastq/, and then realigned using CellRanger. Similarly, we solved issues with Ziffra2021 and Delacher2021 using the cloud delivery service.
Mapping of ATAC-seq fragments to T2T genome
ATAC-seq reads from our multiome B-cell dataset were mapped to the telomere-2-telomere (T2T) consortium reference5 using STAR 2.7.10a. The genome was built using -- limitOutSAMoneReadBytes 10000000 --outFilterMultimapNmax 1000 to aid with centromeric reads which we expect to be multimapping. We also qualitatively checked for centromeric reads by searching for sequences that they may contain.
Common analysis of telomeres and single cell data
Unless otherwise stated, all 10x single-cell ATAC-seq data was aligned using cellranger-atac-2.0.0. Multiome RNA+ATAC-seq datasets were aligned with cellranger-arc 2.0.0. Cell types were annotated according to markers from each respective source article, or annotation files as described for each dataset below.
Telomemore then scanned the aligned BAM-files for reads having more than 3 telomere CCCTAA-motifs. This number was empirically chosen based on a histogram of motif counts across reads, but can also be motivated mathematically. The probability of detecting N telomere motifs in a random M bp sequence is approximately:
P[N telomere motifs in one M bp read] ≈ P[one motif]N * NumberOfPossibleMotifLocations * P[no motifs in remaining part of M bp read] ≈ 0.25N6 * ⊓ i=1..N (M-4-i)
Thus if more than 3 telomere motifs are required for a 50 bp read to be classified as telomeric, this results in a false positive rate of less than 2*10-8; or about 1 in 50 million reads.
A study has shown that the nucleosome signal around CTCF sites specifically correlates with mitosis51. We attempted to generate such a measure by extracting peaks having CTCF motifs in them, and then calculating the nucleosome signal. We however failed to find a consistent correlation with the cell cycle using this approach. We also correlated the “whole-cell” nucleosome signal to nTA, but obtained both positive, neutral and negative correlations depending on the dataset. We thus conclude that nTA might be linked to the nucleosome signal, but in a complex manner that we do not understand. In particular, there is no noticeable nucleosome signal variation across the GC B cells.
Specific analysis of 10x multiome PBMC data
We obtained the 10x PBMC multiome PBMC demo dataset from www.10xgenomics.com (“Fresh Frozen Lymph Node with B-cell Lymphoma (14k sorted nuclei)”, “10k Human PBMCs, Multiome v1.0, Chromium Controller”, “10k Human PBMCs, Multiome v1.0, Chromium X”). Cell types were predicted using SingleR52, based on the DICE53, HPCA54 and Monaco55 reference datasets. Leiden clustering was applied and clusters named according to reference datasets. The nTA-gene correlations are based on the clusters referenced in the text.
Specific analysis of Zhang2021 chromatin accessibility atlas, and chromvar analysis
This dataset further depends on reuse of a pancreas dataset, GSE16047256, which we included. Privacy protected samples (human heart samples on dbGaP: phs001961, human islet samples on dbGaP: phs002204) were not included. Because the sciATACseq cell barcodes reside in the name of the sequencing read, and NCBI SRA strips the read names in their upload, we had to use GEO cloud delivery. Due to the size of the data, we did not perform alignment and de novo cluster assignment. Instead, existing cell type annotations were downloaded from https://data.mendeley.com/datasets/yv4fzv6cnm/1 (1B_Cell_metadata.tsv.gz).
The correlation between gene expression and telomere content was obtained from supplementary the file https://cdn.elifesciences.org/articles/66198/elife-66198-supp2-v2.xlsx57.
To find motifs linked to telomere accessibility, a linear model was set up using Limma58 with models as shown in the main figure. Motif scores were calculated using ChromVar2 through ArchR. CISBP was used as the motif database (http://cisbp.ccbr.utoronto.ca/)59.
Generation of tonsillar B cell multiome data
The research was carried out according to The Code of Ethics of the World Medical Association (Declaration of Helsinki). Ethical permit was obtained from the Swedish Ethical review authority (No: 2016/53-31) and all samples were collected after receiving informed consent from patient or patient’s guardian. Briefly, tonsillar cell suspensions were prepared by tissue homogenizing in RPMI-1640 medium and passed through a 70 μm cell strainer. Red blood cells were lysed using BD PharmLyse lysis buffer according to manufacturer’s instructions. All cell suspensions were frozen in fetal bovine serum (FBS) (Gibco) with 10% DMSO and stored in liquid N2.
B cells from tonsils of 5 individuals were enriched using negative selection magnetic beads (EasySep Human B Cell Isolation kit, Stemcell technologies, #17954). They were then pooled to avoid batch effects, with an average viability per donor of 92% (S.D. 1.5%).
Enriched B cells were washed in ice-cold ATAC-seq resuspension buffer (RSB, 10 mM Tris pH 7.4, 10 mM NaCl, 3 mM MgCl2), spun down, and resuspended in 100 mL ATAC-seq lysis buffer (RSB plus 0.1% NP-40 and 0.1% Tween-20 (Thermo Fisher). Lysis was allowed to proceed on ice for 5 min, then 900 mL RSB was added before spinning down again and resuspending in 50 mL 1X Nuclei Resuspension Buffer (10x Genomics). To assess nuclei purity and integrity after lysis, nuclei were stained with Trypan Blue (15250061, Thermo Fisher Scientific), and DAPI (D1306, Thermo Fisher Scientific), according to manufacturer recommendation. If necessary, cell concentrations were adjusted to equal ratios (per donor) prior to starting single-cell GEM emulsion droplet generation with the ATAC-seq NextGEM kit (10x Genomics). Briefly, nuclei were incubated in a transposition mix. Transposed nuclei were then loaded into a Chromium Next GEM Chip J. 9.000 nuclei were loaded per lane, with a target recovery of 5.500 (doublet rate 4% - 4.8%). After GEM emulsion generation, reverse transcription, cDNA amplification and multiome sc library generation occurred, according to manufacturer specifications (Chromium Next GEM Single Cell Multiome ATAC + Gene Expression, CG000338 Rev A, 10x Genomics, September 2020).
Analysis of tonsillar B cell multiome data
Library reads were aligned and aggregated using CellRanger ARC 2.0.0. The classification of TCRs was done using TRUST460. The SNPs of the cells were extracted using cellSNP61, and assigned to donors using Vireo62. No batch effects between donors were noticed. Single-cell RNA-seq analysis was done using Seurat43, and single-cell ATAC-seq analysis using Signac63. MACS2 was used for peak calling64. The Vireo doublet score was used to filter out droplets. Cell types were predicted using SingleR52, based on the HPCA54 and Monaco55 reference datasets. Furthermore, labels were transferred using Seurat from a previous single-cell RNA-seq-only tonsillar B cell dataset65, and compared. Major axes (GC/non-GC and naive vs switched) aligned but precise clusters did not align satisfactorily. The dataset clusters however had similar topology if annotated based on the same marker genes. We did not find discrete clusters corresponding to “activated” nor “FCRL3hi”. Unwanted cells (T cells, dendritic cells) were removed and ignored in this study. Dimensional reductions were done using UMAP66. Especially the maker CXCR4hi has been used as a marker for the dark zone (DZ), where proliferation happens. However, CXCR4 expression increases during the cell cycle and the dark-light zone axis is a result of cell migration where they compete for their ligand CXCL1267.
All highly varying genes and motif activities were correlated to nTA using spearman’s method.
Because it has been reported that GC B cells, compared to naive and memory, are higher in TERT and have on average longer telomeres68, we also investigated this alternative explanation for AICDAhi cells being nTAhi. TERT is however upregulated in rather separate S-phase cells without effect on nTA. Telomeres have also been found to be 1.4kb longer in naive T cells over memory T cells69; however, assuming an analogy for B cells, we cannot see such a trend on nTA in our data.
We speculated if we could see differences in accessibility over the Ig region, across different GC B cell clusters - especially due to free-floating spliced Ig gDNA. Pileups over the region did however not show any significant differences.
Supplemental files
Supplemental File #1. Correlations of genes and motifs in multiome PBMC dataset to nTA
Supplemental File #2. Correlations of motifs to Zhang2021 atlas dataset to nTA
Data availability
The tonsillar multiome B cell dataset sequencing data has been uploaded to ArrayExpress (E-MTAB-12632). The Telomemore tool is available at Github, https://github.com/henriksson-lab/telomemore.java. Furthermore, scripts to reanalyze the various datasets, and precomputed abundances, are available at https://github.com/henriksson-lab/telomemore_supplemental. The human tonsil B cell data can be viewed interactively at http://data.henlab.org/.
Author contributions
W.R. conceived the name and implemented the first version of Telomemore. A.D. and R.G. contributed material and helped with flow cytometry. I.S.M. performed the single-cell data generation. J.T., M.F. and J.H. supervised. J.H. conceived the study, the algorithm, and wrote the final implementation. All authors have read and agreed to the published version of the manuscript.
Funding
The computations were enabled by resources provided by the Swedish National Infrastructure for Computing (SNIC) at UPPMAX partially funded by the Swedish Research Council through grant agreement no. 2018-05973, under Project SNIC 2021/22-697, SNIC 2021/6-328 and SNIC 2022-5-18. J.H.is supported by Vetenskapsrådet grant number #2021-06602.
Conflicts of Interest
J.T. is employed at Sartorius. I.S.M is employed at Umeå university but partially funded by Sartorius. Other authors declare no conflict of interest.
Acknowledgements
All authors have read and agreed to the published version of the manuscript. We thank Matthew Weirauch for excellent help with the CIS-BP database; Magnus Hultdin, Sofie Degerman and Pär Larsson (Umeå university) for discussions about telomeres; and Xi Chen (SUSTech) for discussions about ATAC-seq. The analysis was inspired by a collaboration with Oliver Billker and Ágnes Regos on malaria ATAC-seq. Filippe Seiz De Filippi did a first attempt at annotating the B cells.