Tn5 transposase-based epigenomic profiling methods are prone to open chromatin bias

Epigenetic studies of rare biological samples like mammalian oocytes and preimplantation embryos require low input or even single cell epigenomic profiling methods. To reduce sample loss and avoid inefficient immunoprecipitation, several chromatin immuno-cleavage-based methods using Tn5 transposase fused with Protein A/G have been developed to profile histone modifications and transcription factor bindings using small number of cells. The Tn5 transposase-based epigenomic profiling methods are featured with simple library construction steps in the same tube, by taking advantage of Tn5 transposase’s capability of simultaneous DNA fragmentation and adaptor ligation. However, the Tn5 transposase prefers to cut open chromatin regions. Our comparative analysis shows that Tn5 transposase-based profiling methods are prone to open chromatin bias. The high false positive signals due to biased cleavage in open chromatin could cause misinterpretation of signal distributions and dynamics. Rigorous validation is needed when employing and interpreting results from Tn5 transposase-based epigenomic profiling methods.


Introduction
Due to the sample loss and inefficient immunoprecipitation of traditional chromatin immunoprecipitation (ChIP)-based methods, low-input epigenomic profiling methods are needed for studying rare samples such as mammalian oocytes and preimplantation embryos 1 .Several low-input chromatin immunoprecipitation followed by sequencing (ChIP-seq) methods including ULI-NChIP 2 , scChIP-seq 3 and STAR ChIP-seq 4 have been developed.To overcome inefficient immunoprecipitation, immunoprecipitation-free methods, such as CUT&RUN 5 and scChIC-seq 6 that use chromatin immuno-cleavage (ChIC) 7 strategy, have been developed.These low-input methods have been widely used in studying epigenome reprogramming during early embryonic development which have revealed distinct dynamics of different epigenetic markers 4,[8][9][10][11][12] .
Recently, the Tn5 transposase-based 13 library construction is getting popular because it can fragment DNA while simultaneously adding library adaptors thus simplifying experimental procedures and reducing sample loss 14,15 .Tn5 transposase-based epigenomic profiling methods utilize Protein A (or Protein G) fused with Tn5 transposase (pA-Tn5) to cleave DNA at the targets of primary antibody, allowing all procedures to be completed in the same tube without immunoprecipitation step, which largely avoided sample loss.Several Tn5-based methods have been developed to capture histone modifications or transcription factor (TF) binding using small number of cells or even single cell, including CUT&Tag 16 , CoBATCH 17 , ACT-seq 18 , itChIPseq 19 , ChIL-seq 20 and Stacc-seq 21 .However, the Tn5 transposase is known to prefer accessible DNA regions 22 .It has been noted that some of the Tn5-based methods are confounded by DNA accessibility 23 , but no systematic comparative analysis has been done to determine to what extent the results of these methods are affected by chromatin accessibility.Here we present systematic comparative analyses which reveal that, for some of the methods, overall ~30-50% false positive peaks can be contributed by open chromatin artefacts.Such high level of false positive peaks could affect data interpretation leading to false conclusions, which raises concerns on choosing, developing and interpreting results from Tn5-based epigenomic profiling methods.

Tn5-based epigenomic profiling methods have varied level of biases toward open chromatins
CoBATCH 17 , CUT&Tag 16 , ACT-seq 18 and Stacc-seq 21 are very similar methods, which are based on the in situ immuno-cleavage strategy (Fig. 1a).CoBATCH and CUT&Tag add primary antibodies first, then add the pA-Tn5, while ACT-seq and Stacc-seq pre-incubate primary antibodies with pA-Tn5.However, besides cleavage at the target sites, the free pA-Tn5 has the potential to tagment open chromatins.To wash out free pA-Tn5, these methods employ different washing conditions.The CUT&Tag uses high salt (300 mM NaCl) washing to suppress background tagmentation, CoBATCH and ACT-seq utilize milder washing conditions, while the washing step is optional in Stacc-seq.The itChIP-seq, on the other hand, is an immunoprecipitation-based method that utilizes Tn5 to tagment DNA first before antibodies are added to pull-down the target fragments.Thus, theoretically the itChIP-seq method should not be confounded by open chromatins as antibody-based selectivity is applied after tagmentation.
To perform systematic comparative analysis of the Tn5-based epigenomic profiling methods, we collected publicly available H3K27me3 data (Supplementary Table 1) of mouse embryonic stem cells (mESCs) generated by CoBATCH, CUT&Tag, Stacc-seq and itChIP-seq.Since the protocols for ACT-seq and Stacc-seq are almost identical, and the original ACT-seq study did not include H3K27me3, we analyzed the Stacc-seq data with conclusions applicable to ACT-seq.The bulk ChIP-seq of H3K27me3 in mESCs was used as a reference for comparing the H3K27me3 peaks derived from these Tn5-based methods.The two different bulk ChIP-seq datasets 24,25 were highly similar (Supplementary Fig. 1a, b) and the peaks were considered as true positive peaks in mESCs.To determine whether each Tn5-based method was confounded by open chromatin, we asked whether peaks that were not overlapped with bulk ChIP-seq peaks were instead overlapped with open chromatin peaks derived from ATAC-seq in mESCs.The open chromatin revealed by ATAC-seq 26 were highly similar to that revealed by DNase-seq 25 in mESCs (Supplementary Fig. 1c, d).As a control for low-input epigenomic profiling method without using Tn5 transposase, we included the H3K27me3 dataset in mESCs generated by ULI-NChIP 8 .
A genome browser view around the Hoxb locus comparing the signals of Tn5-based methods with those of bulk ChIP-seq, ULI-NChIP, and open chromatin (ATAC-seq and DNase-seq) (Fig. 1b) revealed: 1) CoBATCH, CUT&Tag, and Stacc-seq detected H3K27me3 peaks not present in ChIP-seq or ULI-NChIP but overlapping with ATAC-seq and DNase-seq peaks (shaded); 2) For peaks overlapping with ChIP-seq, the peak patterns were more similar to ATAC-seq and DNaseseq rather than the ChIP-seq; 3) itChIP-seq showed the most similar pattern to that of the ChIPseq in this region, which was coincident with the fact that the immunoprecipitation-based itChIPseq procedure is different from the other pA-Tn5 immuno-cleavage-based methods (Fig. 1a).
The above observation raised the possibility that at least some of the Tn5-based methods may be biased toward open chromatin to generate false positive peaks.To explore this possibility, we analyzed the overall signal distribution of these methods by first focusing on the transcription start sites (TSS) of all coding genes.An analysis of the ChIP-seq datasets indicated that the H3K27me3 signals were enriched in the TSSs of a subset of genes consisted of mainly the Polycomb-group (PcG) targets, but with the majority of the genes, mostly of non-PcG targets, lack the H3K27me3 signals around their TSSs (Fig. 1c).However, the CoBATCH and Stacc-seq methods detected H3K27me3 enrichment at almost all the TSS regions including the non-PcG targets, which were more similar to the open chromatin patterns detected by ATAC-seq (Fig. 1c).
The CUT&Tag method detected a weak signal enrichment at the TSSs without H3K27me3 ChIPseq signals.The itChIP-seq and ULI-NChIP methods detected a pattern more similar to that of ChIP-seq although their signals were generally weaker (Fig. 1c).itChIP-seq is not an immunoprecipitation-free method, thus its signals need to be normalized by input control, similar to ChIP-seq and ULI-NChIP.For the immunoprecipitation-free methods, input DNA or IgG control is usually not needed for signal normalization.Nevertheless, we tested whether the confounding of open chromatin signals could be eliminated by using input/IgG control.Using the publicly available input/IgG controls for CoBATCH and Stacc-seq (no input/IgG control for CUT&Tag), we recalculated the H2K27me3 enrichment and found that normalizing with input/IgG did not improve the CoBATCH results (Fig. 1c).On the other hand, this normalization did enhance the signals of Stacc-seq overlapping with ChIP-seq, while reduced the signals not overlapping with ChIP-seq (Fig. 1c).However, the IgG control normalized Stacc-seq H3K27me3 profile was still more similar to the ATAC-seq profile than that of the H3K27me3 ChIP-seq at the non-PcG targets (Fig. 1c).

Open chromatin is the source of false positive peaks detected by Tn5-based methods
Next we performed quantitative analysis to determine the level that each of the Tn5-based method is confounded by open chromatin.To this end, we used the same criteria (p-value < 1e-4 and q-value < 0.01) in peak calling for each method.Peaks that overlapped with ChIP-seq peaks were considered as true positives.Peaks that did not overlap with ChIP-seq peaks were considered as potential false positive signals, and were further analyzed to determine whether they could be mapped to open chromatin (Fig. 2).We found 5,189 out of the 9,125 CoBATCH peaks did not overlap with ChIP-seq peaks, but were mapped to open chromatin with strong correlation to ATAC-seq signals (Fig. 2a).Indeed, 82.3% (4,270 out of 5,189) of the CoBATCH peaks that did not overlap with ChIP-seq peaks were overlapped with ATAC-seq peaks (Fig. 2b).
Peaks derived from CoBATCH normalized by IgG control showed even more false positives and ATAC-seq signals also enriched in these false positive peaks (Supplementary Fig. 2a).A similar analysis of the CUT&Tag dataset revealed 1,387 out of the 6,805 peaks were not overlapped with ChIP-seq peaks (Fig. 2c).Of these non-overlapping peaks, 24.2% (335 out of 1,387) were overlapped with ATAC-seq peaks (Fig. 2d).For Stacc-seq, 1,687 out of the 6,190 peaks were not overlapped with ChIP-seq peaks, but showed strong open chromatin signals (Fig. 2e).Of these non-overlapping peaks, 75.6% (1,275 out of 1,687) were overlapped with ATAC-seq peaks (Fig. 2f).Peaks derived from Stacc-seq normalized by IgG control still showed open chromatin enrichment for the false positive peaks (Supplementary Fig. 2b).On the other hand, the itChIPseq peaks that showed no overlap with ChIP-seq peaks also did not overlap with ATAC-seq peaks (Fig. 2g, h, Supplementary Fig. 2c).For ULI-NChIP, although about half of the peak regions showed no ChIP-seq signals, no ATAC-seq signals were detected in these nonoverlapping regions (Fig. 2i, j

High false positive rate due to open chromatin affected global distribution of peaks
To determine the relative reliability of the different Tn5-based epigenomic profiling methods, we next calculated the false positive rate (FPR) caused by the Tn5 bias toward open chromatins.The FPR is calculated by the number of peaks not overlapping with ChIP-seq but overlapping with ATAC-seq peaks, divided with the total number of peaks (Fig. 3a, Supplementary Fig. 3).The FPR for CoBATCH was as high as 46.8-54.3%, in contrast the FPR for CUT&Tag was 4.9-5.8%.
For Stacc-seq, its FPR was 20.6-35.9%.The itChIP-seq and ULI-NChIP were almost not affected by open chromatin artefacts.Since the FPRs calculated here only considered the nonoverlapping peaks, and because the overlapping peaks could also be generated from open chromatin, instead of real H3K27me3 peaks as exemplified in Fig. 1b, the FPRs presented here represented the lower limit.To calculate the FPR without a fixed p-value or q-value cutoff for the peaks, we assessed the FPRs for top peaks ranked by p-values for each method.Results shown in  1), which had the most publicly available datasets for different methods besides H3K4me3.For multiple replicates with the same or different number of cells for each method, we used the one with the best signal-to-noise ratio as the representative result for each method.
Peak comparison.Peaks were compared using bedtools 30 intersect (version 2.29.2).Peaks with at least half-length intersecting with ChIP-seq / ATAC-seq / DNase-seq peaks were considered as overlapping (bedtools intersect parameters for getting overlapping peaks: -u -f 0.5; parameters for getting non-overlapping peaks: -v -f 0.5).The signal enrichment heatmaps were plotted using deeptools 31 (version 3.5.0)computeMatrix and plotHeatmap.The TSSs of coding genes in the mouse genome were from GENCODE 32 mouse gene set M24.The genome browser snapshot was generated with R package karyoploteR 33 (version 1.18.0).The Pearson correlation and clustering analysis were performed using deeptools multiBigwigSummary and plotCorrelation with 5kb bin size and outliers removed.
into two groups that with or without H3K27me3 ChIP-seq signals (Fig. 1d).The CoBATCH and Stacc-seq detected signals exhibit a clear enrichment at the open chromatin regions without ChIP-seq signals, and input/IgG control normalization did not change the situation.The CUT&Tag method detected weak signals, while itChIP-seq and ULI-NChIP detected no signals at the open chromatin regions without ChIP-seq signals (Fig. 1d).These results indicate that some of the Tn5-based methods, particularly the CoBATCH and Stacc-seq, are biased toward open chromatin peaks that do not have H3K27me3 ChIP-seq signals.
). Collectively, these results indicate that the great majority of the non-overlapping peaks detected by CoBATCH or Stacc-seq, and some of the non-overlapping peaks detected by CUT&Tag are mapped to open chromatin regions, and they could be false positive artefacts.

Fig. 3b indicated
Fig. 3b indicated that most of the top peaks in CoBATCH represented open chromatin signals.Interestingly, while replicate 1 of the Stacc-seq showed lower FPR for the top peaks but the FPR gradually increased with more peaks, replicate 2 of Stacc-seq showed a relatively consistent FPR (Fig. 3b).Consistent with the high FPR, clustering analysis indicated that CoBATCH and Staccseq globally resembled open chromatin signals more closely than the H3K27me3 signals (Fig. 3c).These results indicate that CoBATCH and Stacc-seq have high false positive rates and thus great care should be taken in interpreting the data generated by these two methods.

Fig. 3 |
Fig. 3 | False positive rates of Tn5-based methods due to open chromatin a, Overall false positive rate (FPR) due to open chromatin (measured by ATAC-seq) artefacts for each method.The number of cells (#cell) used for each library was indicated below each bar.b, False positive rate due to open chromatin (measured by ATAC-seq) artefacts for the top peaks in each method.c, Clustering of global H3K27me3 signals of each method with ATAC-seq and H3K27me3 ChIPseq based on the Pearson correlation between any two methods (The Pearson correlation coefficients were shown in each box; bin size: 5kb).

2 |
Evaluation of peaks called from different methods with input / IgG normalization a, Significant peaks (q-value<0.01)called from CoBATCH with IgG control normalization were compared with open chromatin signals measured by ATAC-seq (n1: CoBATCH peaks overlapping with ChIP-seq peaks; n2: CoBATCH peaks not overlapping with ChIP-seq peaks; C: center of CoBATCH peaks; FC: fold-change over IgG control).b, Significant peaks (q-value<0.01)called from Stacc-seq with IgG control normalization were compared with open chromatin signals measured by ATAC-seq (n1: Stacc-seq peaks overlapping with ChIP-seq peaks; n2: Stacc-seq peaks not overlapping with ChIP-seq peaks; C: center of Stacc-seq peaks; FC: fold-change over IgG control).c, Significant peaks (q-value<0.01)called from itChIP-seq (100 cells) with input control normalization were compared with open chromatin signals measured by ATAC-seq (n1: itChIPseq peaks overlapping with ChIP-seq peaks; n2: itChIP-seq peaks not overlapping with ChIP-seq peaks; C: center of itChIP-seq peaks; FC: fold-change over input control).

3 |
Overall false positive rate (FPR) due to open chromatin (measured by DNase-seq) artefacts for each method.The number of cells (#cell) used for each library was indicated below each bar.
by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.It is made