Deletion mapping of regulatory elements for GATA3 reveals a distal T helper 2 cell enhancer involved in allergic diseases

The GATA3 gene is essential for T cell differentiation and is surrounded by risk variants for immune traits. Interpretation of these variants is challenging because the regulatory landscape of GATA3 is complex with dozens of potential enhancers spread across a large topological associating domain (TAD) and gene expression quantitative trait locus (eQTL) studies provide limited evidence for variant function. Here, we perform a tiling deletion screen in Jurkat T cells to identify 23 candidate regulatory elements. Using small deletions in primary T helper 2 (Th2) cells, we validate the function of five of these elements, two of which contain risk variants for asthma and allergic diseases. We fine-map genome-wide association study (GWAS) signals in a distal regulatory element, 1 Mb downstream, to identify 14 candidate causal variants. Small deletions spanning candidate rs725861 decrease GATA3 expression in Th2 cells suggesting a causal mechanism for this variant in allergic diseases. Our study demonstrates the power of integrating GWAS signals with deletion mapping and identifies critical regulatory sequences for GATA3.


Main text
T cells orchestrate adaptive immune responses by differentiating into distinct subsets of effector and regulatory T cells. The GATA3 transcription factor (TF) is central to this process and participates in the differentiation of virtually all T cell subsets. For example, high expression of GATA3 drives T helper 2 (Th2) cell differentiation 1 , maintains the identity of regulatory T (Treg) cells 2,3 , and disrupts differentiation of T helper 1 and T helper 9 cells 4,5 . GATA3 has a central role in adaptive immunity and genome-wide association studies (GWAS) have uncovered many nearby genetic variants that are significantly associated with immune-related traits including rheumatoid arthritis 6,7 , multiple sclerosis 8 , type 1 diabetes 9 , asthma and allergic diseases 10,11 . The risk variants for these traits are non-coding and may affect GATA3 expression, however their interpretation is challenging because we lack a deep understanding of GATA3's cis-regulatory landscape. The high density of GWAS hits near GATA3 and its cell-type specific regulation, with different expression profiles in different subsets of T cells, make it an excellent model system for interpreting trait-associated human genetic variation. Here, we integrate functional genomic data with CRISPR-mediated genome deletions to identify regulatory elements for GATA3 and to illuminate the function of genetic variants associated with allergic diseases.
To survey the transcriptional regulation of GATA3, we profiled its expression using a published dataset of sorted immune cells 12 . GATA3 expression is absent in B cells and monocytes, is moderate in naïve CD4 + T cells, and is highest in memory Treg cells and Th2 cells, consistent with its established role in Th2 cell differentiation and maintenance 1,13 (Fig. 1a, b). This motivates our decision to study the regulation of GATA3 specifically in T cells.
To gain a high-level view of the trait-associated genetic variation surrounding GATA3, we examined lead single nucleotide polymorphisms (SNPs) from the GWAS catalog 14 and classified them into three categories based on immune traits that are potentially mediated by T cells: autoimmune and allergic diseases, leukemia and lymphoma, and blood cell traits (Fig. 1c). The lead SNPs for these trait categories are distributed differently across the 2 Mb region surrounding GATA3. Lead SNPs for allergic and autoimmune diseases are located predominantly at GATA3 and downstream with a major cluster 1 Mb away. Most lead SNPs associated with leukemia and lymphoma are at the GATA3 gene. Finally, lead SNPs for blood cell traits are scattered across the entire region. The varied locations of lead SNPs for different traits suggest that different regulatory sequences may be involved in the different trait categories.
Given that active regulatory sequences may physically contact their target genes and are typically in open chromatin, we examined published data from Hi-C performed in activated CD4 + T cells 15 and from the assay for transposase accessible chromatin (ATAC-seq) performed in sorted populations of immune cells 16 . These data reveal that GATA3 is located near the boundary of a large (~1.3 Mb) topologicalassociating domain (TAD) that extends downstream of the gene (Fig. 1d-e). Dozens of accessible chromatin regions are scattered throughout this TAD and could be considered candidate regulatory elements for GATA3. However, this analysis falls short of functionally testing whether these elements regulate GATA3 expression.
To test if lead SNPs for several immune traits are associated with the expression of GATA3, we examined data from gene expression quantitative trait locus (eQTL) studies 12,17,18 . However, the eQTL data provide only weak evidence that GWAS hits affect GATA3 expression in T cells (Supplementary Table 1). One possible explanation is that power to detect eQTLs is limited by the modest effects of common variants on gene expression and by the relatively small sample sizes of eQTL studies in relevant cell types 19 .
We reasoned that small genomic deletions could determine which sequences regulate GATA3 expression and could overcome some of the limitations of eQTL studies. Specifically, deletions can directly test the effect of sequences on GATA3 expression (overcoming ambiguity caused by linkage disequilibrium) and are more likely to have large effects on gene expression than common variants used in eQTL studies. We therefore performed a paired-guide tiling deletion CRISPR/Cas9 screen 20,21 to discover regulatory sequences that control GATA3 expression in T cells. With sufficient deletion efficiency, paired-guide screens can cover much larger genome regions than single guide screens or base-editor screens, which have previously been used to screen selected regulatory elements [22][23][24][25] . Moreover, unlike regulatory mapping with CRISPR interference or CRISPR activation [26][27][28][29][30][31] , deletion-based mapping can detect functional sequences that may be insensitive to epigenetic silencing or activation.
To screen for regulatory sequences, we designed 14,769 pairs of single guide RNAs (sgRNAs) to tile across a 2 Mb genome region centered on GATA3. The guide pairs target sites separated by a median distance of 1043 bp and are intended to introduce small genome deletions that span the two target sites.
The median step size of the deletions is 96 bp, such that each base in the screened region is covered by a median of 8 intended deletions ( Supplementary Fig. 1). We note however, that due to variation in guide efficiency and non-homologous end joining (NHEJ), paired guides introduce spanning deletions only some fraction of the time. Instead, they often introduce small insertions/deletions (indels) at each of the target sites 32,33 . To address this, we estimated the deletion efficiency of our system by performing pairedguide mediated deletions of several 1-2 kb sequences in the 2 Mb survey region and measuring the deletion efficiency by quantitative PCR (qPCR) (Supplementary Fig. 2). We estimate the spanning deletion efficiency to be 20-25% and account for this in our analysis described below.
We performed our paired guide screen in Jurkat T cells. While Jurkat is a T cell leukemia cell line, it is a useful model system because the chromatin landscape surrounding GATA3 mirrors that of primary T cells ( Fig. 1d, Supplementary Fig. 3) and it has been previously used to discover disease-relevant T cell regulatory elements 29 . To perform the screen, we cloned oligos encoding the sgRNAs and spCas9, generated a lentiviral library, and transduced the library into Jurkat T cells (Fig. 2a). We selected for cells with viral genome integration and flow sorted them into pools based on GATA3 protein expression ( Supplementary Fig. 4). Finally, we performed deep sequencing of the sgRNA pairs in each expression pool (Fig. 2a). We conducted four biological replicates of the screen, sorting into three expression level pools in the first two replicates and seven expression level pools in the other two replicates (Fig. 2b).
To examine the data generated by the screen, we estimated the proportion of sgRNA pair counts in each pool and compared the sgRNA pairs targeting GATA3 exons (foreground) to non-targeting control sgRNAs (NTCs) included in the screening library. As expected, sgRNA pairs targeting GATA3 exons are depleted in the high expression pools and enriched in the low expression pools (Fig. 2c). In contrast, sgRNAs targeting sequences outside of the gene (background), are only slightly enriched in the low expression pool compared to NTCs ( Supplementary Fig. 5), which is consistent with a small fraction of them targeting regulatory sequences for the gene. These results indicate that the replicates provide consistent results and that the sgRNA counts can be used to detect sequences that affect GATA3 expression (Fig. 2c).
To discover regulatory sequences that affect GATA3 expression, we jointly analyzed the screen data across all of the pools and replicates using RELICS 34 . RELICS is designed to discover functional sequences (FSs) from tiling CRISPR screens and includes features for modeling programmed deletions.
RELICS can also leverage data from multiple pools to detect FSs that cause smaller changes in gene expression. RELICS has been extensively validated on experimental data and outperforms other tiling CRISPR screen analysis methods 34 . When running RELICS, we used an area of effect model that assumes a spanning deletion efficiency of 25% to account for the variable mutation events generated by NHEJ ( Fig.   2d and Supplementary Fig. 2).
In total, RELICS predicts 23 FSs that affect GATA3 expression in Jurkat T cells, under a log likelihood ratio threshold of 6 (P=5e-4 by likelihood ratio test) (Fig. 2e-f). These FSs are distributed asymmetrically, with all but one located within the TAD that contains GATA3. Most FSs are located within 0.5 Mb of GATA3, with some much farther away, and a substantial fraction of them overlap with H3K27ac peaks in Jurkat cells (Fig. 2f).
A limitation of our paired-guide deletion screen is that it was performed in the Jurkat cell line which is less physiologically relevant than primary cells. We therefore view the FSs predicted by RELICS as candidate regulatory sequences and proceeded to validate a subset of the candidates in primary T cells.
We selected Th2 cells for validation experiments because GATA3 expression is highest in Th2 cells (Fig.   1b), and most candidate regulatory sequences (marked by accessible chromatin or H3K27ac) that are present in other T cell subsets are also present in Th2 cells (Figs. 1d, and 3a). We isolated human naïve CD4 + T cells, differentiated them into Th2 cells 35 , and electroporated them with pairs of spCas9 ribonucleoproteins (RNPs) 36,37 (Fig. 3b). An advantage of small deletions is that they can improve the resolution of RELICS' predictions, which are typically ~1.2kb due to the variable deletion products and efficiencies. As controls, we targeted a "safe harbor" (SH) sequence downstream of GATA3 with no predicted regulatory function for GATA3 (negative control) and targeted a GATA3 exon with a single sgRNA (positive control) (Fig. 3c). We verified the products generated by each deletion experiment using Sanger sequencing and tracking of indels by decomposition (TIDE) (Supplementary Figs. 6 and 7) 38 .
We first targeted the top-ranked prediction, FS1, for validation in Th2 cells. FS1 is located within the third intron of GATA3 and overlaps a strong H3K27ac peak. We designed a pair of sgRNAs to delete a 74 bp sequence from FS1, which contains an ATAC-seq peak in Th2 cells (Fig 3e). Deletion of this sequence reduces GATA3 expression substantially ( Fig. 3c and Supplementary Figs. 7-8) indicating that it acts as an enhancer for GATA3.
FS23 is orthologous to a mouse genome sequence that was previously identified as a strong enhancer for GATA3 in mouse T cells 39 . In addition, this sequence overlaps with ATAC-seq peaks in all T cell subsets and has strong H3K27ac signals in Jurkat cells, naïve CD4 + T cells, and Th2 cells. To test the function of this 'mouse enhancer' sequence, we performed spCas9 RNP experiments to delete a 1011 bp of the sequence. We found that GATA3 expression is reduced compared to SH control experiments, confirming that this sequence acts as an enhancer in human T cells (Fig. 3c).
Finally, we deleted 80 bp of FS10 and 123 bp of FS22. Both deletions decrease GATA3 expression ( Fig.   3c and Supplementary Fig. 8). FS10 is marked by H3K27ac and acts as another classical enhancer in Th2 cells (Fig 3e). FS22 lacks H3K27ac signals, but it does overlap with an ATAC-seq peak (Fig 3e), suggesting that it may function differently from a canonical enhancer.
To determine whether regulatory elements identified in our screen can help interpret GWAS variants, we intersected the FSs with GWAS hits and observed that FS1 and FS14 are located within two distinct clusters of risk variants associated with autoimmune and allergic diseases (Fig. 3a). We examined the region surrounding FS14, which is almost 1 Mb downstream of GATA3, and which has a high density of lead SNPs (Fig. 3a). This region is contained within a broad 44 kb H2K27ac domain, which is present in Th2 cells but not in other T cell subsets, suggesting that it may be a distal Th2-specific enhancer (Fig. 3a). 40 and high GATA3 expression is required for differentiation and maintenance of Th2 cells 1 , we decided to (i) analyze recent GWASs for asthma and allergic diseases (allergic rhinitis or eczema) 10 , and (ii) test the function of sequences containing risk variants in in vitro differentiated Th2 cells.

Because Th2 cells have an established role in allergic diseases
A single GWAS hit for asthma is situated ~1 Mb downstream of GATA3 (Fig. 4a) and a pair of independent GWAS hits for allergic diseases are located at ~400 kb and ~1 Mb downstream of GATA3.
We name these hits risk region 1 and risk region 2 ( Fig. 4a,b) and focus on risk region 1, because it is associated with both traits and overlaps the large Th2-specific enhancer-like sequence described above.
We performed fine-mapping to identify 95% credible sets (CSs) containing candidate causal variants using SuSiE 41 . For asthma, we identified two CSs: CS1, which contains three candidate causal SNPs, and CS2, which contains 11 candidate SNPs (Fig. 4c). For allergic diseases, we identified a single CS containing the same three candidate SNPs as CS1 (Fig. 4d, 5a).
We examined the worldwide allele frequencies of the SNPs with the highest posterior inclusion probabilities (PIP) in CS1 (rs12413578) and CS2 (rs725861). For rs12413578, the risk allele is the major allele, and the protective allele has the highest allele frequencies in populations with European ancestry and the Indian subcontinent (Fig. 5b). In contrast, the risk allele for rs725861 is the minor allele and has the highest frequencies in Western Africa and the Indian subcontinent (Fig. 5b).
To test the function of candidate SNPs in the two CSs, we designed pairs of guides to delete small sequences surrounding 6 of the 14 candidates. Only some of these pairs of guides yielded high-efficiency deletions, but fortunately these deletions encompassed the SNPs with the highest PIP in each CS (Fig 5c). 5d-e and Supplementary Fig. 9). This indicates that this distal sequence functions as an enhancer in Th2 cells and that risk variants for allergic diseases and asthma are likely to affect GATA3 expression in this cell type.
In response to pathogens, naïve CD4 + T cells become activated and differentiate into effector and regulatory T cell types to mount appropriate immune responses. However, incorrect immune responses lead to autoimmune and allergic diseases. Our results show that expression of GATA3, a key regulator of T cell differentiation, is coordinated by a large pool of downstream regulatory sequences and demonstrate the power of high-throughput deletions coupled with computational analysis and validation experiments in primary cells. Sequences detected by this approach can be remarkably useful for the interpretation of trait-associated genetic variation, and we provide evidence that risk variants for allergic diseases affect the function of a Th2-specific GATA3 enhancer.

Antibodies
All antibodies used in the study for fluorescence activated cell sorting, flow cytometry, and cell culture are listed in Supplementary Table 2.

Oligos
All oligos used in this study are listed in Supplementary Table 3. Using this sequence, we extracted a list of all possible candidate guides of length 19-21bp that were adjacent to a spCas9 protospacer-adjacent motif (PAM) (NGG, for forward strand guides and CCN for reverse strand guides). We reduced the set of candidate guides to those with a G in first position, because a leading G provides the most efficient expression from a U6 promoter.
Once a list of candidate guide sequences was established, we computed specificity scores for each guide so that off-target effects could be reduced. We aligned guides to the genome with BWA 42 , allowing up to three mismatches to the reference sequence. Using the genome alignments, we calculated a specificity score for each candidate guide using the method described by Hsu et al. 2013 43 . We then discarded all guides with a specificity score less than 20, which retained enough guides to select reasonably-spaced guide pairs, while reducing off-target effects.
To choose pairs of guides for tiling deletions we defined a minimum step size (S; distance from first target site from previous pair of guides) and deletion size (D; distance between two target sites). When designing each deletion, we chose the pair of guides from our filtered guide list such that the step size and deletion size thresholds were exceeded by the smallest possible amount. We set S to 65 bp and D to 1000 bp, which gave us a total of 14,555 targeting guide pairs with a median size of 1043 bp for D and median step size of 96 bp for S, such that each base in the screened region would be covered by an average of 8 programmed deletions (Supplementary Fig. 1). In addition, we designed 214 non-targeting negative control guide pairs, which were random sequences that did not align to any sequence in the reference human genome.

CRISPR/Cas9 tiling deletion screen
Our screening method was based on the published CREST-seq method 20  side scatter (SSC), then aggregates were excluded using pulse width for FSC and SSC before gating on PE labeling (for Cas9 protein) and lastly, APC signal intensity (GATA3 protein level). Cells were sorted into low, medium, high expression pools of GATA3 for the first 2 replicate screens, and into low, medium-low, medium, medium-high, high-low, high, high-high expression pools of GATA3 for replicate

Tabulating guide pair counts
We discarded all guides that overlapped with deletions, insertions, translocations and inversions identified by whole-genome sequencing of the Jurkat cell line 44 . In total we excluded 1302 guide pairs by these criteria leaving 13,253 targeting guide pairs for analysis.
To assign sequencing reads to guide pairs and count the number of guide occurrences in each pool, we mapped the reads to the guide library using BWA-MEM 45 . We first created a custom 'reference genome' library that contained the complete set of designed guide pairs. Each guide pair in the reference library included flanking 5' and 3' sequences from the viral vector. After aligning guides to the custom reference, we retained only those guide pairs with at least 60 matches (as defined by the CIGAR string). Aligned reads where the guide pairs were observed to be recombined 46 were discarded.

RELICS analysis of guide pair counts
We used RELICS (https://github.com/patfiaux/RELICS) 34 to jointly analyze the guide pair counts across all pools and replicates (24 pools total). RELICS leverages a set of known functional sequences, labeled FS0, to estimate sorting parameters, which describe the probability that cells containing different guides will be sorted into each pool. We labeled the sequences corresponding to GATA3 exons as FS0. Based on the definition of FS0, RELICS divides all guide pairs into two sets: those predicted to generate deletions that overlap FS0 (foreground), and all other targeting guide pairs (background). Using RELICS, we obtained maximum likelihood estimates of sorting parameters for foreground, background and non-targeting control guides for each pool in each replicate. We plot the log2 ratio of foreground/non-targeting control sorting probabilities in  RELICS requires that the user specify a maximum number of FSs, and we set this value to 60.
RELICS models the dispersion of observed guide counts as a function of the total count of a guide across pools using fitted splines, with a user-specified number of degrees of freedom. We set the the degrees of freedom to 2, which yielded a good spline fit for all replicates (Supplementary Fig. 10).
For paired guide deletion screens, RELICS allows for the possibility that small insertions/deletions (indels) at each target site may occur instead of the intended deletion between the target sites. In addition, RELICS allows the indels to occur with a probability that varies with distance from the target sites. We specified that the large deletions occur with relatively low efficiency (probability 25%), and an area of effect for indels that followed a normal distribution with standard deviation of 8.5 bp. These parameters were set based on published experimental data regarding the mutational events that result from the use of spCas9 and paired-guides 32,47,48 .
Peak calling for ChIP-seq and ATAC datasets For Jurkat cells, ChIP-seq data for H3K27ac was obtained from Hnisz et al. 2016 49 . To process the sequencing data, the adaptors of the reads were trimmed and the quality of the reads were accessed using trim_galore (v 0.6.6), a wrapper for cutadapt (v 1.18) 50 and fastqc (0.11.9) (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Reads were aligned using BWA (v 0.7.17) 42 using with the bwa mem command and default arguments. The alignments were sorted and alignments with MAPQ score less than 30 were removed using samtools (v 1.12) 51 . Duplicate reads were removed using picard (v 2.25.4) (https://broadinstitute.github.io/picard/). For in vitro differentiated Th2 and iTreg cells, ChIP-seq data for H3K27ac were obtained from Soskic et al. 2019 52 . We downloaded the alignment data in CRAM format, which we converted to BAM. We then merged BAM files from three donors based on cell type, activation state, and stimulation time before data processing.
CRISPR/Cas9-mediated deletion with synthetic sgRNAs Guides for individual deletion experiments with spCas9 RNPs were designed using FlashFry 55 , CRISPR-SE 56 , and GuideScan 57 . Single guide RNAs (sgRNAs) with predicted high specificity scores and low offtargeting were selected to generate small deletions (<130bp) at target regions. sgRNAs were synthesized by IDT with modifications on the 5' and 3'ends to enhance efficiency 37 . For example, for an sgRNA oligo with target sequence GCACTAAATCATTCACTGGG, the final synthesized sgRNA is mG*mC*mA*rCrUrArArArUrCrArUrUrCrArCrUrGrGrGrGrUrUrUrArArGrArGrCrUrArUrGrCrUrGrG rArArArCrArGrCrArUrArGrCrArArGrUrUrUrArArArUrArArGrGrCrUrArGrUrCrCrGrUrUrArUrCrAr ArCrUrUrGrArArArArArGrUrGrGrCrArCrCrGrArGrUrCrGrGrUmG*mC*mU*rU ("r" for a RNA base, "m" for a 2'OMe modified RNA base, "*" for a phosphorothioate bond.). The sequences of the guides targeting safe harbor regions, GATA3 exon, FS1, FS10, FS22, mouse enhancer which contains FS23, rs12413578 and rs725861 are listed in the Supplemental Table 3. Synthetic sgRNAs were reconstituted in nuclease-free water and stored at -80°C with small aliquots.
To assemble the dual spCas9 RNP complexes, 90 pmole of each sgRNA was mixed with 5 μg

GWAS eQTL intersection analysis
Summary statistics for genome-wide significant (P < 5e-8) lead SNPs within 1 MB of GATA3 were downloaded from the GWAS catalog 14 . We selected a subset of studies to represent a spectrum of immune disease traits (multiple sclerosis 8 , rheumatoid arthritis 59 , type 1 diabetes 9 , allergic disease and asthma 10 ), choosing the GWAS with the largest sample size when there were multiple hits for the same trait. We then intersected the lead SNPs from these studies with uniformly processed immune cell eQTL data from BLUEPRINT 60 and DICE 12 downloaded from the eQTL catalog 18 . We report P-values for 'gene expression' eQTL associations for each GWAS lead SNP in Supplementary Table S1.