Variation in PU.1 binding and chromatin looping at neutrophil enhancers influences autoimmune disease susceptibility

Neutrophils play fundamental roles in innate inflammatory response, shape adaptive immunity1, and have been identified as a potentially causal cell type underpinning genetic associations with immune system traits and diseases2,3 The majority of these variants are non-coding and the underlying mechanisms are not fully understood. Here, we profiled the binding of one of the principal myeloid transcriptional regulators, PU.1, in primary neutrophils across nearly a hundred volunteers, and elucidate the coordinated genetic effects of PU.1 binding variation, local chromatin state, promoter-enhancer interactions and gene expression. We show that PU.1 binding and the associated chain of molecular changes underlie genetically-driven differences in cell count and autoimmune disease susceptibility. Our results advance interpretation for genetic loci associated with neutrophil biology and immune disease.


Results
Non-coding DNA sequence variation affects chromatin state and gene expression within human populations, and accounts for the majority of complex genetic traits and disease associations [4][5][6][7] . The commonly accepted model of genetic control of transcriptional activity postulates that genetic variation modifies the DNA recognition sequences of specific transcription factors (TFs), thus altering their ability to bind to DNA at a specific locus [8][9][10][11][12][13][14][15] . PU.1 (encoded by Spi1) is a key TF regulating myeloid development [16][17][18] , and its deficiency has profound effects on neutrophil maturation and function 19,20 . To study genetically determined variation in PU.1 recruitment to DNA, we used chromatin immunoprecipitation sequencing  Table 2). Lead PU.1 tfQTL SNPs showed a bimodal distribution of distances to their respective differential binding peaks (Figure 1b, Supplementary Table 3), with just over half of them (55%, 1,036/1,868) mapping proximally from the peak edge (<2.5kb; median distance 264bp), and the remaining SNPs (45%; 995/1,868) localising more distally (2.5Kb-1Mb, median distance 23Kb) 21,22 . As shown for other cell types 22 , tfQTL effect sizes were stronger for proximal compared to distal variants (t-test p=2.2x10 -16 , Figure 1c). We further validated a subset of the detected tfQTLs using allele-specific association analysis 23 (Online Methods), which confirmed a significant allelic imbalance for the majority of the tested peaks (98.8% and 95.5% for peaks associated with proximal and distal variants respectively; Figure 1d).
Binding of pioneer TFs to DNA alters local nucleosome positioning, thus allowing recruitment of activating co-factors 24 . However, DNA recognition sequence alone is not sufficient to establish occupancy, and secondary collaborating factors are required to maintain affinity 25 . C/EBPβ is upregulated throughout neutrophil terminal differentiation 18 and has been shown to co-occupy myeloid enhancers at thousands of PU.1 bound sites 26,27 . The constitutively expressed CTCF is known to play a role in gene regulation by anchoring chromatin interactions 28 , but is not known to functionally associate with PU.1. We assayed these two additional TFs in neutrophils from a subset of overlapping individuals (n=22 donors with QC-pass assays for C/EBPβ, n=30 for CTCF), and identified 18,862 C/EBPβ and 22,197 CTCF filtered peaks from the combined datasets. We performed QTL analysis as before, and prioritised 427 C/EBPβ and 769 CTCF putative tfQTLs reaching a nominal p-value threshold (p≤1x10 -5 ; Supplementary Table 2). We found that C/EBPβ tfQTLs effect sizes decreased with increasing distance from PU.1 tfQTLs (Figure 2a-b), reflecting cooperative binding of PU.1 and C/EBPβ at myeloid enhancers 26,29 . Interestingly, CTCF tfQTLs displaying a shared genetic effect with PU.1 predominantly involved CTCF-bound regions located distally to the PU.1 tfQTL lead SNP, suggesting that PU.1 QTL genetic effects may be in part mediated by the 3D chromosomal architecture (Figure 2c). Transcription factor occupancy has been shown to act predominantly through cis regulatory SNPs, where coordination of cis-acting variants has been shown to decay with increasing physical distance of SNPs from bound regions 30 . To assess the potential sharing of our tfQTLs across cell types, we additionally generated PU.1 binding maps in primary monocytes (CD14 + CD16 − ) isolated from ten BLUEPRINT donors 7 , five of which overlap with the neutrophil PU.1 dataset. Of the neutrophil PU.1 peaks implicating a tfQTL, 93% were also observed in monocytes (Supplementary Figure 2a). The low number of donors tested did not allow us to carry out a tfQTL analysis in monocytes. To assess coordination of genetic effects at PU.1 binding sites across cell types, we therefore assessed the strength of binding at monocyte peaks for individuals stratified by PU.1 tfQTL lead variant genotype. We found the monocytes displayed consistent direction and strength of binding at proximal SNPs (linear regression p=3x10 -9 ) compared to neutrophils (p=2x10 -13 ), compatible with shared genetic effects between the two cell types. However, the same was not true for distal SNPs (neutrophils p=4x10 -7 , monocytes p=0.793; Figure 2d), which may be driven by more complex and cell-type specific long-distance chromatin contacts.
To explore coordination of genetic influences on PU.1 binding and local chromatin state, we initially took advantage of the previously published histone associated QTL (hQTL) data for the enhancer-associated histone marks H3K4me1 and H3K27ac in neutrophils 7 . In total, 808 H3K4me1 and 946 H3K27ac lead hQTL SNPs overlapped (r 2 ≥0.8) PU.1 tfQTLs. We next generated binding profiles of the active promoter-associated histone mark H3K4me3 and Polycomb-associated repressive mark H3K27me3 in neutrophils (n=110 and n=109 donors, respectively) identifying 621 and 367 shared tfQTL/hQTLs, respectively (Supplementary Table   2). Using the pi1 statistic 31 , we found evidence of sharing between PU.1 tfQTLs with hQTLs in both neutrophils and monocytes (pi1 H3K27ac =0.73-0.76, and pi1 H3K4me1 =0.76-0.80). Sharing between neutrophil PU.1 tfQTLs and hQTLs detected in CD4 naïve T cells was lower We next assessed the distance between the PU.1 and histone mark peaks for each shared tfQTL-hQTL genetic association. As previously observed 21 , there was a pronounced bimodal distribution of distances between PU.1 binding peaks and the locations of H3K27ac and H3K4me3 marks (Figure 3a), with around a half of PU.1 peaks localizing to less than 1kb away from the respective H3K27ac and H3K4me3 peaks, and others mapping 10-100kb from them. Given that H3K4me3 is associated with active promoters, this observation highlights the potential long-range regulatory effects of PU.1 binding to distal DNA elements on promoter activity, which are commonly mediated by three-dimensional DNA looping interactions 33 . To investigate the role of PU.1 long distance regulation, we generated Capture Hi-C (PCHi-C) profiling in neutrophils and monocytes isolated from three donors each and integrated these data with previously published PCHi-C data for these cell types in three more individuals 34 (Supplementary Figure 4a-b). We detected ~190,000 Promoter Interacting Regions (PIRs) in total across neutrophils and monocytes (CHiCAGO score > 5) 35    To determine the putative target genes underpinning PU.1-mediated disease associations, we integrated PCHi-C and eQTL data 7 in neutrophils. Overall, 27 high confidence target genes at QTL loci colocalised with GWAS summary statistics (Supplementary Table 6). Interestingly, 35% of the shared tfQTL / GWAS SNPs could be attributed to a proximal tfQTL at a PU.1 binding site. This finding suggests that many of PU.1 tfQTL themselves are under distal genetic control potentially mediated through enhancerenhancer interactions [53][54][55] . One such example is the rs791357_C variant associated with decreased neutrophil and monocyte cell counts. PCHi-C data shows that this region is highly connected to the CPEB4 gene in both neutrophils and monocytes ( Figure 5b). CPEB4 is a cytoplasmic polyadenylation element which binds to recognition sequence in PolyA tail of mRNAs and can activate or inhibit translation 56 . CPEB4 is involved in controlling terminal differentiation in erythroid cells 57 and the proliferation of some cancers 58 . The SNP rs791357 is a proximal tfQTL for PU.1 (p=9.05x10 -21 ) and C/EBPβ (p=1.963x10 -9 ), an hQTL for H3K4me3 (p=1.98x10 -17 ) and H3K27ac (p=1.41x10 -33 ) and an eQTL for CPEB4 (p=1. In conclusion, our analysis suggests that genetically-determined variation in PU.1 binding in neutrophils modulates gene expression, acting via changes in the local chromatin state and, at least in some cases, in the patterns of promoter-enhancer interactions. We show that these effects underpin the genetic associations for a number of important human blood cell traits and diseases, confirming the role of PU.1 in neutrophil biology and implicating this cell type as a potentially causal for a number of autoimmune traits.

Competing Interests statement
The authors declare no competing interests.

Peripheral adult blood collection
ChIP-seq data generated in this study used donor samples which were collected as part of the previously described study 7 . Blood was obtained from donors who were members of the NIHR Cambridge BioResource (http://www.cambridgebioresource.org.uk/) with informed consent (REC 12/EE/0040) at the NHS Blood and Transplant, Cambridge. Donors were on average 55 years old (range 20-75 years old), with 46% of donors being male. A unit of whole blood (475 ml) was collected in 3.2% Sodium Citrate. An aliquot of this sample was collected in EDTA for genomic DNA purification. A full blood count (FBC) for all donors was obtained from an EDTA blood sample, collected in parallel with the whole blood unit, using a Sysmex Haematological analyser. The level of C-reactive protein (CRP), an inflammatory marker, was also measured in the sera of all individuals. All donors used for the collection had FBC and CRP parameters within the normal healthy range. Blood was processed within 4 hours of collection.

Isolation of cell subsets
Samples were as those as described in 7 . To obtain pure samples of 'classical' monocytes (CD14+ CD16-) and neutrophils (CD66b+ CD16+) we implemented a multi-step purification strategy. Whole blood was diluted 1:1 in a buffer of Dulbecco's Phosphate Buffered Saline (PBS, Sigma) containing 13mM sodium citrate tribasic dehydrate (Sigma) and 0.2% human serum albumin (HSA, PAA) and separated using an isotonic Percoll gradient of 1.078 g/ml (Fisher Scientific). Peripheral blood mononuclear cells (PBMCs) were collected and washed twice with buffer, diluted to 25 million cells/ml and separated into two layers, a monocyte rich layer and a lymphocyte rich layer, using a Percoll gradient of 1.066g/ml. Cells from each layer were washed in PBS (13mM sodium citrate and 0.2% HSA) and subsets purified using an antibody/magnetic bead strategy. To purify monocytes, CD16+ cells were depleted from the monocyte rich layer using CD16 microbeads (Miltenyi) according to the manufacturer's instructions. Cells were washed in PBS (13mM sodium citrate and 0.2% HSA) and CD14+ cells were positively selected using CD14 microbeads (Miltenyi). To purify neutrophils, the dense layer of cells from the 1.078 g/ml Percoll separation was lysed twice using an ammonium chloride buffer to remove erythrocytes. The resulting cells (including neutrophils and eosinophils) were washed and neutrophils positively selected using CD16 microbeads and CD66b (BIRMA 17C, IBGRL-NHS) for neutrophils. Purity was on average 95% for monocytes and 98% for neutrophils.

ChIP-sequencing
Purified cells were fixed with 1% formaldehyde (Sigma) at a concentration of approximately 10 million cells/ml. Fixed cell preparations were washed and stored re-suspended in PBS at 4°C prior to lysis and sonication. Sonication protocols were performed in a Diagenode PicoRuptor for 8 cycles of 30 seconds on, 30 seconds off in a 4 o C water cooler. Samples were checked for sonication efficiency using the criteria of 150-500bp, by Agilent DNA bioanalyzer.
ChIP-seq was carried out as previously described 59   Significant peaks were selected to be at 1% FDR or less.

Data Quality
We removed ChIP samples that had a relative strand correlation (RSC) < 0.8 and normalised strand correlation (NSC) < 1.05 61 . We defined high confidence data those from ChIP with RSC > 0.8 and NSC > 1.05. Otherwise, we used genome browser tracks to confirm visually a good ChIP and include it in the final data set. Supplementary Figure 1 and Supplementary
For PU.1, H3K4me3 and H3K27me3 we set the minimum number of samples for a peak to be included in consensus to 3, for C/EBPβ, CTCF and monocyte samples minimum was set to 2.
Sex chromosomes were not included in the QTL analysis. The reference peak set was filtered further for read counts as described below. Next, we generated quantification signal of ChIPseq for each donor. Here we only considered read counts under the peaks, as the regions outside peaks are more likely to be noise or background signal than true enrichment. For each donor, we generated a vector of log2 reads per million (log2RPM) per peak in the reference peak set by counting the number of overlapping reads under the peaks (BEDOPS bedmap -count) and normalised the counts with the total number of reads in the library. We further filtered the reference peak set to only consider peaks with log2RPM > 0 in at least 50% of the donors in a given cell type, corrected for ten PEER factors and applied quantile normalisation across donors. For QTL calling with H3K27me3, two sets of summary statistics are provided on two separate signal matrices. In the first set H3K4me3 peak annotations were used in conjunction with H3K27me3 signal to enrich for poised promoter QTLs. In the second set broad called H3K27me3 peaks were divided into 2500bp windows.

Identification of PU.1 and C/EBPβ differential binding sites
We used DiffBind version 1.12.0 with default EdgeR (3.8.3) option to identify peaks which were differentially bound between neutrophils and monocytes. We used the six best quality samples and their peak sets for this analysis: We selected peaks present in at least three individuals and that had a minimum three-fold difference in binding signal as cut off. Heatmap visualisation of differentially bound regions Deeptools 2 64 .

Transcription factor enrichments
For determining enrichment of ChIP-seq regions of interest within PIRs we used regioneR PIRs that were not shared with neutrophil and monocyte.

Differentially expressed genes and gene expression counts
Gene expression counts and list of differentially expressed genes were available from Ecker et al. 3

QTL mapping
Cis-acting QTL mapping was done using the LIMIX package 66 , available from github (https://github.com/PMBio/limix). We considered genetic variants mapping to within 1 Mb (on each side) of each tested feature (peak), and tested their association using linear regression.
Models were fit on quantile-normalized PEER residuals, also including a random effect term accounting for polygenic signal and sample relatedness (as in the variance component models above we used the realized relatedness matrix to capture sample relatedness). From the linear regression we obtained the effect size and p-value for each tested association. To correct for multiple hypothesis testing, we performed a two-step procedure 67 : first, we corrected for multiple testing across variants for each molecular outcome using Bonferroni correction and, second, we adjusted the obtained p-values for multiple-testing across phenotypes within each layer using a the Q-value procedure 31 , considered QTLs at a significance threshold of 5% FDR.

Promoter Capture HiC (PCHi-C)
Cells were isolated as described 34 . One donor was used for preparing each PCHi-C library. In total, twelve PCHi-C libraries were prepared, six using monocytes and six using neutrophils. HICUP and CHiCAGO Sequencing reads were processed and mapped with HiCUP and PCHi-C interaction was called using CHiCAGO with default parameters 35,69 .

Genotyping check of ChIP-Seq and PCHi-C bams
Identity matching for each sample and for each analysis was performed by extracting genotypes from RNA-seq and ChIP-seq and comparing them to SNPs from the WGS data.
The first stage of verifying the sample identity concordance between the RNA-seq/ChIP-seq and WGS data involved pre-processing the BAM files for one autosomal chromosome (chr1) to remove PCR duplicates and reads with mapping quality score <10. The variants were then called from the resulting BAM file using mpileup from the SAMtools package 70 . The variants with QUAL <20, DP <5 and GQ <5 were filtered out. Then, we compared genotypes of the filtered variants with genotypes generated from WGS and imputation. The genotypes generated were considered to be from the same sample if the concordance rate was greater than 90%.

Allele specific analysis of transcription factor binding
For allele specific analysis, we used the phased WGS VCF that was also utilised for QTL mapping but here we removed indels and only considered biallelic single nucleotide variants.
We then mapped deduplicated ChIP-seq reads on each allele of each SNVs using GATK ASEReadCounter with default parameters, base quality ≥2 and mapping quality ≥15. We then filtered for heterozygous SNVs only with ≥ 10 read counts per site and nonzero counts in both alleles. We required 2 donors meeting these read counts criteria at each site. To carry out association analysis, we used Rasqual 23 with total read counts per sample as offset parameter. Note that Rasqual uses a model that corrects for reference mapping bias and genotyping errors. To correct for non-genetic confounders, we applied PCA with and without permutation on normalised read counts in log2RPM across all sites and picked the first N components whose explained variances are greater than those from permutation as covariates for Rasqual. Finally, we only considered SNVs found within peaks to determine direct allele specific effect on TF binding of PU.1 and CTCF in neutrophils.

Enrichment analysis of tfQTLs and hQTLs in PIRs
Each of these heterozygous SNVs was annotated based on whether they were located in a PIR and whether they were significant tfQTLs (PU.  2,3,4).

Enrichment of genome wide association SNPs within ChIP-seq marked regions
To test for significant enrichment of trait associated SNPs within regions of interest, we applied GWAS analysis of regulatory or functional information enrichment with LD (GARFIELD) 48 .
H3K27ac and H3K4me1 occupied regions in neutrophils were obtained from 7 . Neutrophil annotations for PU.1, C/EBPβ, H3K4me3 and H3K27me3 were generated as described above. With the exception that H3K27me3, regions were not chunked into 2.5Kb bins.
Monocyte annotation are described in Supplementary Figure 3a for PU.1 and C/EBPβ.

Colocalisation between diseases and molecular trait
To overlap our QTL results to GWAS catalogue, we calculated the LD information based on our WGS data using plink v1.9 73 . For all the QTLs that either directly mapped to the GWAS variants or in LD (r 2 ≥0.8), we considered that the QTL variant overlapped with a GWAS signal.
For the cases where we further selected six autoimmune diseases, we took forward the overlapping disease variants with P-value ≤5x10 -8 in six selected studies are celiac disease 46 . The associations of IBD, CD and UC in the European cohorts were used for this study.
We also used Type 2 diabetes 47 as a negative control. We used a Bayesian colocalization      Figure 3: Identification of PU.1 and C/EBPβ differentially bound binding sites. a. Pairwise differential binding analysis between Monocytes and Neutrophils for PU.1 and C/EBPβ was performed to identify cell type enriched binding events for the two factors. Read density heatmap +/-2.5kb from centre of binding site. b. Intersection of transcription factor binding sites from A. with chromatin state maps for neutrophils derived using ChromHMM 74 . Pie charts; blue segment is the proportion of TF binding category that fall within regions classed as active in neutrophils. c. Intersection of transcription factor binding sites from A. with chromatin state maps for monocytes. Pie charts; green segment is the proportion of TF binding category that fall within regions classed as active in monocytes. Figure 4. Summary of PCHi-C data in two cell types. a. Principal component analysis plot showing first 2 principal components across twelve PCHi-C data sets. b. Heat map of Pearson correlations for intersecting (1bp overlap) PIRs (CHiCAGO >5) from twelve data sets. c. Bar plot of the number of promoter enhancer connections called with a CHiCAGO score >5 for twelve data sets. d. Using the PCHi-C data from seven individuals, we selected 310,233 heterozygous sites in neutrophils (NEU) and 288,385 sites in monocytes (MON) with allele-specific (AS) bias >1.5 or <0.67, though we removed sites with extreme Allele Specific (AS) bias (<0.01 or >100). 89% of sites with AS bias in NEU and 92% in MON had a consistent AS ratio if found in more than one individual. The percentage of sites with consistent AS bias detected in two individuals drops to 16% and 15%, and it drops further to just 3% or to <1% when shared by three or four individuals. The plot shows the mean allele-specific (AS) ratio or bias and the 95% confidence interval by individual in neutrophils and monocytes. The mean AS ratio increases when AS ratios are shared across samples in a consistent manner. In almost all individuals AS ratios are higher in monocytes than in neutrophils. e. Left heat map; Percentage of sharing between cell types of SNVs that are located in PIRs. Sharing is shown between monocytes (left) and neutrophils (right) in five matched samples (monocytes mean = 14.4%, neutrophils mean = 18.2%) and one mismatched sample (Mon2 and Neu5), (monocytes mean = 6.0%, neutrophils mean = 6.6%). Right heat map; Percentage of sharing between cell types of SNVs that are significant PU.1 QTLs (p<1x10 -5 ). Sharing is shown between monocytes (left) and neutrophils (right) in five matched samples (monocytes mean = 37.0%, neutrophils mean = 33.9%) and one mismatched sample (Mon2 and Neu5), (monocytes mean = 17.0%, neutrophils mean = 15.9%). f. Allele-specific SNVs, identified through PCHi-C, were selected if they were observed in at least two samples or cell types, and if their REF/ALT ratio was consistent, i.e. either > 1 or < 1, across all samples. Enrichment of QTLs in PIRs was then calculated for 14,000 SNPs that fulfilled these criteria and that were supported by an increasing number of samples that showed evidence for falling into a PIR and being a significant QTL (N=1,2,3,4). Enrichment increased when increasing the number of supporting samples.

Supplementary Figure 5. Frequency of motif disruption for transcription factor families at colocalised loci.
Bar plot with the frequency of motif families with a predicted transcription factor disruption (CATO score >0.1). The top 6 clusters harbour 79% of the 231 tfQTLs (i.e. PU.1 lead tfQTLs and proxies with LD>0.8).