Abstract
At the heart of gene regulation are Transcription Factors (TF), proteins which bind to DNA in a sequence specific manner and drive the activation or repression of genes. Here, we present a statistical thermodynamics framework (ChIPanalyser) which models and predicts binding of TFs. We focused on investigating the binding mechanisms of three TFs that are known architectural proteins CTCF, BEAF-32 and su(Hw) in three Drosophila cell lines (BG3, Kc167 and S2). While CTCF preferentially binds only to a subset of high affinity sites located in open chromatin, BEAF-32 binds to most of its high affinity binding sites available in open chromatin. In contrast, su(Hw) binds to both open chromatin and also regions displaying intermediate levels of accessibility. Most importantly, differences in TF binding profiles between cell lines for these TFs are mainly driven by differences in DNA accessibility and not by differences in TF concentrations between cell lines. Finally, we investigated binding of Hox TFs in Drosophila and found that Ubx prefers open chromatin, while Abd-B and Dfd are capable to bind in partially closed chromatin. Overall, our results show that TFs display different binding mechanisms and that our model is able to recapitulate this diverse repertoire of mechanisms.
Introduction
Decades of research have shown that gene expression is at the heart of many, if not all, cellular processes. From development to cellular homoeostasis, the activation or repression of gene expression enables cells, and by extension organisms, to function properly. One of the key components of the regulation of gene expression is Transcription Factors (TFs). TFs are a class of proteins that bind to DNA in a sequence specific manner [1, 2]. The most commonly used experimental method to determine specific regions of DNA where TFs bind is chromatin immunoprecipitation followed by sequencing (ChIP-seq) [3, 4]. This technique has become the gold standard to determine the binding profiles of TFs to the genome, but, despite the huge impact on understanding gene regulation, it does not provide a mechanistic model of what drives the binding of TFs to those regions or even how genes are regulated. While we still lack a complete predictive model for gene expression, over the years, many factors have been identified as contributing to context dependant TF binding.
The most fundamental aspect to consider with respect to TF binding specificity is the DNA sequence itself. Most TFs exhibit a preferred binding motif [5, 6, 7, 8]. The most common way to describe this motif is in the form of a Position Weight Matrix (PWM); a measure of binding energy between TFs and DNA weighted by the genomic base pair frequency [5, 9]. Although PWMs have shown their ability to describe TF preferred binding sequences, they have limitations when it comes to predicting TF binding loci or sites. In particular, TFs can have tens of thousands of binding sites within each genome, yet they would only bind to a few hundred or thousands of them [10, 11, 12, 13].
Unsurprisingly, there is a large body of early research describing binding of TFs to DNA as a similar problem as ligand to receptor binding [14, 15, 16]. At its core, this approach relies on local concentration of TF (ligand) and receptor availability (binding sites) but also association and dissociation constants between TFs and DNA [17]. Previous studies have shown that some TF binding events are TF concentration dependent [18, 19, 20, 21, 22], where varying the concentration of the TF will drive the expression of different set of genes. A good example is the case of Homeobox genes, more commonly known as Hox genes. In Drosophila, embryo patterning is believed to be a consequence of varying Hox TF concentration along the anterior-posterior axis [23] and recent efforts have shown that Hox TF concentration is sufficient to predict cellular identity along anterior-posterior axis with a high accuracy [24]. However, there are many more spurious sites where TFs could bind than functional binding sites. This still begs the question: how does a TF recognises the right binding motif out of many decoys?
One way to reduce the number of available sites is to consider DNA accessibility. Are these sites even available for binding in the first place? This assumes that TFs would bind only to sites that are accessible and cannot locate sites in dense chromatin [25, 26, 27, 28, 29]. Nevertheless, there is a certain class of TFs that would ignore accessibility restrictions and these TFs are known as pioneer TFs. More specifically, pioneer TFs can bind sites in closed dense chromatin and subsequently open the chromatin [30, 31, 32].
We previously showed that statistical thermodynamics can be used to model with high accuracy TF binding to DNA [22]. Considering only binding energy between TFs and DNA (estimated by the PWM and a scaling factor modulating the binding energy), the number of bound molecules to the DNA and DNA accessibility, we modelled binding of five TFs in Drosophila embryo. Our results confirmed that, for some TFs, this model is sufficient to explain the majority of observed binding events in ChIP data and we were able to backwards infer number bound molecules and specificity for five TFs in Drosophila embryo (bcd, cad, gt, hb and Kr).
In this manuscript, we build upon our previous model and developed ChIPanaylser a versatile, fast, efficient and user-friendly R/Bioconductor package [33, 34]. We used this model to describe the behaviour of several Drosophila TFs: CTCF, BEAF-32, su(Hw), Ubx, Abd-B and Dfd. Our results provide a mechanistic interpretation of the TF binding behaviour and propose a new classification of TFs based on fine details of their binding mechanism. In particular, we found that DNA accessibility is the main driver that explains binding of CTCF, BEAF-32 and su(Hw) in three Drosophila cells (BG3, Kc167 and S2) and that relative medium changes in the concentrations of these TFs lead to only negligible changes in their binding profiles. Finally, we also show that TF binding specificity can be achieved by their capacity to bind to regions with different levels of DNA accessibility. In particular, we showed that Ubx, Abd-B and Dfd binding to DNA could be explained by their different capacity to bind dense chromatin, with Ubx binding only in highly accessible chromatin and Dfd and Abd-B binding in denser chromatin.
MATERIALS AND METHODS
Model Description
ChIPanalyser is an R package available on Bionconductor [33, 34]. The package is an implementation of the statistical thermodynamics model proposed in [22]. Briefly, the model requires a PWM (Position Weight Matrix) or PFM (Position Frequency Matrices) of the TF of interest, DNA accessibility data to model binding site accessibility and two additional parameters: λ (a PWM scaling factor) and N (the number of bound molecules) [22]. The probability of a position j on the DNA being occupied is given by [22]: λ and N are difficult to estimate from experimental data and, thus, we used ChIP-seq data and select the values of these parameters that maximise (or minimise) the goodness of fit metrics.
Datasets
To carry out the analysis described in this manuscript, we selected data originating from various sources.
DNA Sequence
Reference Sequences of Drosophila melanogaster were extracted from the Bsgenome R packages [35]. We used both dm3 [36] and dm6 [37] versions of the Drosophila genome. In particular, we only used dm6 for Hox TF analysis and dm3 for the rest of the analysis. It should be noted that the choice between dm3 and dm6 was from a consistency stand point. All modEncode data sets (see section ChIP-seq) use the dm3 build of the genome while ATAC-seq and associated Hox ChIP-seq data were aligned to the dm6 version of genome.
PWM and PFM
Binding Motif matrices were downloaded from online repositories such as JASPAR [38] or extracted from the MotifDb R package [39], which collects and compiles PFMs and Position Probability Matrices (PPM) from various online repositories (see Figure S1 in Supplementary Material).
ChIP-seq
Both ChIP-seq enrichment signal and ChIP-seq peaks were downloaded (pre-processed) from modEncode [40] in three Drosophila cell line: Kc167, S2 and BG3. When it was required, supplementary data sets were downloaded from GEO. GEO datasets were aligned to the genome (dm3) using bowtie-2 (--non-deterministic). Note that for Hox TF analysis we used dm6. Peaks and pile-up signal were called using macs2 with a 0.01 FDR (-q 0.01). As decribed above, choosing between dm3 and dm6 was made based a consitency stand point. modEncode data sets were aligned to the dm3 version of the Drosophila genome. Datasets used for this analysis are described in Table S1 in Supplementary Material.
DNA accessibility
DNase I hypersensitivity data was generated by modEncode for the three cell lines used in this analysis [40, 41]. We extended the DNase Hypersensitivity Sites (DHS) by 500bp (see Figure S2 in Supplementary Material). DNA was either considered accessible (DHS) or non accessible. ATAC-seq data for Kc167 cells was used from [42]. We selected a series of ATAC-seq signal thresholds that we would use as a cut off point to select accessible/inaccessible DNA. These thresholds were based on signal quantiles from 0.05 to 0.95 by 0.05. We also considered 0.99, 0.999, 0.9999 quantile thresholds. We will refer to this method a Quantized Density Accessibility (QDA).
RNA-seq
In order to rescale TF abundance between cell lines we used RNA-seq data from [43], who preprocessed original modEncode datasets [44]. RNA-seq relative abundance was used to rescale the estimated number of bound molecules for one cell line to another.
Package Description
The workflow of ChIPanalyser is described in Figure 1. Briefly, the optimal set of parameters (for λ and N) are inferred from ChIP-seq data by maximising (or minimising) the goodness of fit metric. Using these values, ChIPanalyser will predict ChIP-seq like profiles for different genomic regions and compare the prediction with the actual ChIP-seq data.
ChIPanalyser uses a set of genomic regions to carry out the optimisation of the model. Generally, these regions would be user provided. For our analysis, we investigated regions that contain both ChIP-seq peaks and accessible DNA. To do so, we binned the genome into bins of 20 Kb and used the processingChIPseq function provided by ChIPanalyser. This functions returns a normalised ChIP enrichment score for the top n regions with respect to ChIP-seq data (both number of peaks and peak enrichment) and DNA accessibility data. It should be noted that the number of regions selected is user specified.
During this step of the analysis, we also included a noise filtering method. The current model does not consider ChIP depletion therefore all negative score are replaced by 0. With that in mind, ChIPanlyser provides four methods of filtering noise: Zero, Mean, Median and Sigmoid. Zero removes only depletion score (equivalent to “no noise filtering”). Mean and Median replace all scores below the mean and the median after filtering out depletion scores. Finally, Sigmoid applies logistic weighting to every score. The logistic mid point is set at the 95th quantile of ChIP scores. Lower bound is zero and upper bound is 2. Consequently each score will be multiplied by a weight: if the score is above the 95th quantile the score will be weighted by values between 1 and 2. If the score is below the 95th quantile, score will be weighted by a factor ranging from 0 to 1. All analysis in this manuscript was carried out after using the Sigmoid noise filtering method.
Once the loci of interest have been selected, we computed the optimal set of parameters by using ChIPanalyser computeOptimal function. The optimal set of parameters are inferred by maximising (or minimising) the average goodness of fit metric over all regions selected. ChIPanalyser offers 12 different metrics: correlation coefficients (Pearson, Spearman and Kendall), Mean Squared Error (MSE), Kolmogorov-Smirnov Distance, precision, recall, accuracy, F-score, Matthew’s correlation coefficient (MCC) and Area Under Curve Receiver Operator Characteristic (AUC ROC or just AUC) (see Table 1). We also developed a novel method that describes the ratio of shared geometric area between curves and difference in area between curves.
The optimal parameters can be visualised in the form of a heat map describing the score associated to each combination of λ and N. Heatmaps are produced using the plotOptimalHeatMaps function. Finally, using the optimal set of parameters, ChIPanalyser will produce ChIP-seq like profiles. Profiles can be visualised using the plotOccupan-cyProfiles function provided by the package.
RESULTS
Goodness of fit metrics are context dependent
Previously, we showed how statistical thermodynamics can be used to mechanistically explain the binding of TFs in Drosophila [22]. The optimal set of parameters (see Materials and Methods) was inferred by maximising correlation and minimising Mean Squared Error (MSE) between the predicted profile and experimental ChIP-seq data. Nevertheless, we observed that, in some cases, the predicted profiles and ChIP-seq profiles display low correlation coefficient despite the profiles looking similar. Conversely, high correlation coefficients were also associated with poor overlap between predicted and actual ChIP profiles (e.g. see Figure S3A and S3B in Supplementary Material). In addition, selecting the optimal parameters was hindered by little variation in correlation between parameter combinations. As a consequence the selection of these parameters was exclusively driven by MSE (see Figure S3C in Supplementary Material). We hypothesised that these discrepancies could be due to either background noise in ChIP-seq data or biases in the goodness of fit metrics that we used (Pearson coefficient of correlation and MSE).
To reduce the potential influence of background noise, we tested four noise removal methods: Zero (removes only depletion score), Mean (replace all score below the mean), Median (replace all score below the median) and Sigmoid (applies logistic weighting to every score); see Materials and Methods. To test the performance of these methods, we used three CTCF datasets: (i) a ChIP-chip dataset with very little background noise (modEncode 2639), (ii) a ChIP-seq dataset with high background noise (modEncode 3674) and (iii) a combination of all ChIP-seq datasets in S2 cells (by adding enrichment signals together at a base pair level); see Table S1 in Supplementary Material. We ran the model on the top ten regions and searched for the optimal set of parameters (λ and N) that optimised each goodness of fit metric (see Materials and Methods). We normalised the signal in order to ensure equal contribution of each data set (see Table S1 in Supplementary Material). All four noise filtering methods have little to no effect on ChIP data. The Sigmoid method showed a slight signal reduction in smaller peaks (especially for noisy datasets), which was then translated into a slight improvement of the mean Area Under Curve Receiver Operator Characteristic (AUC ROC) score between ChIP signal and our predictions (see Figure S4 in Supplementary Material).
In addition to Pearson correlation and MSE, we tested several goodness of fit metrics to verify the influence of the metrics on our model. In particular, we compared correlation (Pearson, Spearman and Kendall), MSE, Kolmogorov-Smirnov Distance, precision, recall, accuracy, F-score, Matthew’s correlation coefficient (MCC) and AUC ROC (see Table 1). In addition, we also developed a novel method that describes the ratio of geometric shared area between curves and difference in area between curves (see Materials and Methods). We used the same three CTCF datasets as described above and observed the emergence of two classes within these metrics: (i) similarity metrics that describe how similar the two curves are (correlation coefficients, precision, MCC, Accuracy, F-score and AUC ROC) and (ii) dissimilarity metrics that measure of how different two curves are (MSE, geometric ratio, recall and Kolmogorov-Smirnov distance). Our results showed that depending on the metric used, the optimal set of parameters would vary significantly, but each of the two classes (similarity and dissimilarity metrics) displayed different values for the optimal parameters (see Figure 2A-C). However in certain instances the optimal set of parameters selected by dissimilarity method would overlap slightly with parameters selected by similarity methods (see Figure 2D-F).
Goodness of fit metrics influence the way the model selects the optimal parameters, but how does this translate to the individual predicted ChIP profile level? We further investigated this behaviour at the individual loci using the same three CTCF datasets. Figure 2G-I shows that similarity metrics (black shades) tend to be less prone to false positive peaks but miss the actual ChIP signal level within the peak (the height of the peak). On the other hand, dissimilarity metrics (light blue shades) generate far more false positives but accurately recover the height of the peaks.
Overall, the best performing metrics were AUC ROC, MSE and geometric ratio. AUC ROC occasionally missed peaks completely but seemed to recover peak height fairly accurately, while geometric ratio and MSE rarely missed peaks but also tended to predict a higher number of false positive peaks. For much of the following analysis, we used AUC ROC and MSE, since they are more widely used estimators and performed best.
DNA accessibility plays a key role in the binding of TFs
Steric hindrance can influence the binding of some TFs to DNA, meaning that a TF molecule would only bind stretches of DNA if they are accessible. Any given genomic region can be considered either accessible or inaccessible and that is sufficient to explain the binding profiles of most of TFs [22]. Here, we selected accessible DNA based on DNase Hypersensitivity Sites (DHS) in three Drosophila cell lines (Kc167, S2 and BG3) and, as a point of comparison, we also considered all DNA to be accessible (No Access). We focussed our analysis on three TFs: CTCF, BEAF-32 and su(Hw) (a break down of each data set can be found in Table S1 in Supplementary Material). The influence of accessibility was measured by computing the median AUC score over all regions for the best performing set of parameters. In this instance, we used AUC scores as we were interested in recovering peak location more than peak height. Figure 3 shows that, for CTCF and BEAF-32, the binding predictions were improved when considering DNA accessibility. Nevertheless, su(Hw) displayed a different behaviour, as the mean AUC decreased when DNA accessibility was considered for most ChIP-seq datasets (Figure 3B). Note however, that there are two replicates were this is not the case: Kc167 su(Hw) and BG3 3718 su(Hw).
While DNA accessibility seems to improve the predictions, we also observed that the number of bound molecules (N) and scaling factor (λ) show a reduced influence when DNA accessibility is considered (Figure 3). In particular, we observed less variation in AUC for different set of parameters, when DNA accessibility is included, i.e., larger circles indicate that number of bound molecules and λ have a more important role in TF binding, while smaller circles indicate that they have a less important role. The trend is true for CTCF and BEAF-32, but less strong for su(Hw).
As described, the genome was split into tiles of 20 Kb and then parsed to ChIPanlyser. To factor in for potentially differences in the capacity of the model to predict binding in regions with strong or weak ChIP signal, we selected the top 20, 50, 100, 150, 200, 300 and 500 regions in terms of ChIP signal that also contained accessible DNA. We then looked at how the AUC score changes when regions with weaker binding are included in the analysis or when DNA accessibility is considered. For each number of regions selected and for each data set, we subtracted the AUC score when no accessibility was considered from the AUC score with DHS accessibility. Our results indicate that CTCF, BEAF-32 and su(Hw) are all influenced by DNA accessibility albeit in a different manner. First, we observed that both CTCF and BEAF-32 displayed significantly higher AUC scores when DNA accessibility was included, supporting the previous findings (Figure 4A-B and Figure S5 in Supplementary Material). In addition, AUC scores for CTCF decreased as the number of regions selected for analysis increased (Figure 4A and Figure S5 in Supplementary Material), while BEAF-32 AUC scores were not affected by the increase in the number of regions (Figures 4B and E Figure S5 in Supplementary Material). This means that BEAF-32 performs much better when DNA accessibility is considered, but its binding does not seem to be influenced by the number of regions selected. BEAF-32 would bind anywhere along the genome as long as it has an accessible site. CTCF also displays better AUC scores when accessibility is considered, but, in contrast to BEAF-32, analysing more regions (also with weaker binding) negatively affect the performance of the predictions for CTCF. This implies that CTCF binds in accessible DNA but preferentially binds to genome hotspots. We call the BEAF-32 a global binder and CTCF a hotspot TF.
Furthermore, Figure 4C and F shows that there is a small but statistically significant (p < 0.05) reduction in AUC score for su(Hw) when DNA accessibility is included, which indicates that su(Hw) would bind in less accessible DNA. Our su(Hw) predictions worsened as the number of regions increased, but only when DNA accessibility was considered (see Figure S5 in Supplementary Material). The opposite became true when DNA accessibility was not considered (see Figure S5 in Supplementary Material). While, su(Hw) did not generally perform well when DNA accessibility is considered, the performance of our model to predict su(Hw) binding is also tied to the number of regions selected and our results show that the preferred binding sites of su(Hw) are found in inaccessible DNA. Increasing the number of regions only increases the probability of including these high affinity sites in the analysis.
This analysis was performed by optimising AUC scores, but we also run a similar analysis for optimising MSE and our findings are also supported in that case (Figure S6 in Supplementary Material).
Number of bound molecules and TF specificity influences TF binding
The number of bound molecules and the scaling factor of the PWM scores have an impact on the binding profiles of TFs [22]. Building upon this idea, we sought to identify the optimal set of parameters by minimising the MSE between our predicted ChIP profile and experimentally produced ChIPseq profiles. In this part of the analysis, we used MSE as the question at hand requires the predicted curve to follow experimental curve not only in location but for relative enrichment as well.
The first step was to show that the optimal parameters selected were consistent between different biological replicates. If this was not case, we would not be able to ascertain the validity of our inferred parameters. The optimal set of parameters can be visualized as a heat map showing goodness of fit score for a set of bound molecules and scaling factors. Despite strong variations between experimental data, we show that the predicted optimal set of parameters remained similar between biological replicates and differences tend to arise from differences between cell lines (see Figure 5). This suggests that despite biological and technical variation between replicates performed by different labs using different protocols, our model robustly infers similar number of bound molecules and scaling factor for a given TF. Interestingly the same robustness carries over to other goodness of fit metrics (see Figures S7 and S8 in Supplementary Material).
Furthermore, we found that CTCF seems to be more abundant in BG3 cells than Kc167 cells, with S2 cells displaying intermediate levels (see Figure 5). In contrast, BEAF-32 and su(Hw) seem to have similar number of bound molecules in the three cell lines. The optimal parameter estimates for the best performing number of regions can be found in Table S2 and Table S3 in Supplementary Materialfor both AUC and MSE.
To investigate the influence of these parameters, we assumed that a high variation of goodness of fit score for each combination of parameters would suggest a strong influence of these parameters on TF binding. If goodness of fit scores varied little between parameter combination, we can then concluded that they do not strongly influence our predicted profiles. We analysed the standard deviation of MSE between different set of parameters and we found that some TFs are not strongly influenced by the number of bound molecules or the scaling factor (described by circle size in Figure 3). CTCF and BEAF-32 showed a decrease in sensitivity to number of bound molecules and the scaling factor when accessibility is considered (Figure 3A and C). This means that DNA accessibility would be the strongest driver towards predicting TF binding. Restricting the amount of available binding motifs would be more influential than TF copy number and the ability of a TF to discriminate between high and low affinity sites and we can see this in the decrease of parameter sensitivity with DHS. When all DNA is considered accessible, these parameters have a stronger influence on modulating the predicted curves. It should be noted that this behaviour was also observed when using other metrics than MSE e.g. AUC in Figure 3).
ChIPanalyser predicts TF binding in different cell lines by considering relative mRNA abundance
We wanted to further investigate the predictive capabilities our model and also demonstrate its mechanistic soundness for CTCF, BEAF-32 and su(Hw) in the three cell lines. For that, we estimated the optimal set of parameters in one cell line and aimed to predict TF binding in a different cell line taking into account changes in DNA accessibility using DHS data and changes in number of bound molecules using relative changes in RNA abundance. For example, we estimated the optimal set of parameters for CTCF in Kc167 cells that would minimise MSE as λ = 2.5 and N = 2 × 105 over the top 20 regions (see Materials and Methods). By rescaling N based on relative RNA-seq levels of CTCF in the two cells lines, we could approximate the number of CTCF molecules bound to DNA in BG3 cells (≈ 3.3 × 105). This together with BG3-specific DNA accessibility data is capable to predict with high accuracy the ChIP-seq profile in BG3 cells (see Figure 6A and B). RNA rescaling seems to recover both the number of peaks and their location with high accuracy. Moreover, the rescaling of number of bound molecules did not lead to any difference in terms of MSE variation between estimated and rescaled (Figure 6G). The same analysis was performed for BEAF-32 (Figure 6C, D and H), where we estimated parameters in BG3 cells (λ = 2 and N = 2 × 105) and rescaled the number of molecules in S2 cells (≈ N = 1.2 × 105). Once again, the model correctly predicts ChIP profiles in both location and relative enrichment. Finally, for su(Hw) (Figure 6E, F and I) we estimated parameters in Kc167 cells (λ = 3 and N = 5 × 104) and rescaled the number of molecules in S2 cells (≈ N = 3 × 104). Again, the predictions of the model are accurate.
Our results show that changes between cells in DNA accessibility and number of molecules are enough to explain the changes in TF binding profiles. Nevertheless, we still do not know which of the two is the more important factor or whether both have similar contributions. To address this, we also assumed that in the predicted profile there is no change or a one thousand time reduction in the number of bound molecules and repeated the analysis. Figure 6 shows that using the same TF abundance as in the original cell line did not change the predictions quality at all. In fact, we observed a significant reduction in the predicted profile only when reducing the number of bound molecules by 1000. These results show that cell differences in binding profiles of TFs would mainly come from differences in DNA accessibility and not relatively small changes in TF abundance. The only way that TF abundance could impact the binding profile (and, consequently, lead to changes in gene regulation) is when the expression of the TF is strongly downregulated. The fact that number of bound molecules variations between cell lines have small effects on the binding profiles is not so surprising. Figures 5 and 3 show that there are wide range of values for the number of bound molecules that lead to optimising the goodness of fit metrics (see also Figures S7 and S8 in Supplementary Material).
Hox genes show differentially binding preferences with respect to DNA accessibility
Hox proteins are key players during development. Recently it has been suggested that Hox proteins show different binding preferences with respect to DNA accessibility [42]. Most notably, Ubx and Abd-A would bind predominately in open chromatin, while other Hox TF (Lab, Pg, Dfd, Scr and Abd-B) would prefer closed chromatin. We selected three Hox TFs (Ubx, Dfd and Abd-B) and ran our model using different levels of DNA accesibility. DNA accesibility level were selected based on quantile distribution of ATAC-seq scores (see Materials and Methods). Briefly, this means that higher QDA scores lead to the fewer regions being marked as accessible.
We selected regions containing both peaks and accessible regions and then, for each QDA accessibility, selected the optimal set of parameters by maximising the AUC score between our prediction and ChIP-seq data. Our results show that Ubx exhibits a preference towards open chromatin. In Figure 7A, the maximum AUC score for Ubx increases with increasing the QDA score. This can be explained by the fact that increasing the threshold of ATAC-seq score ensures that open chromatin regions are truly opened and not an intermediate state. Dfd and Abd-B on the other hand were not strongly influenced by QDA accessibility. This means that these TFs can bind in inaccessible DNA. According to our model, Ubx performed best with 0.99 QDA (99th quantile of ATAC-seq scores - AUC 0.862), while Abd-B and Dfd with 0.95 QDA and 0.8 QDA respectively (see Figure 7B).
The model recovers the position of peaks accurately especially for Ubx (see Figure 7C-E). While for Dfd and Abd-B most of the peaks are detected, their height is not always an accurate representation of the strength of the ChIP-seq signal. Hox TFs are known to display cooperative interactions and there are reports that both Dfd and Abd-B have a higher number of sites in the bound peaks, suggesting they bind cooperatively to open the chromatin [42]. Our model does not include cooperative interactions and this could explain the reduced performance for Dfd and Abd-B.
DISCUSSION
Our analysis shows that ChIPanalyser and its underlying model predicts binding profiles of TFs (ChIP) with high accuracy and it can also shed light on the binding mechanism of TFs. We show how ChIPanalyser not only predicts location of peaks but can correctly predict the enrichment of a TF at a given location. The optimal set of parameters seem to be grounded into the biology itself. Here we untangled the interplay between DNA accessibility and the number of bound molecules and how that impacts on the binding of TFs to the DNA.
TFs used different binding mechanisms
In this analysis, we focused our attention on three DNA binding proteins: CTCF, BEAF-32 and su(Hw). All three TFs are known architectural proteins in Drosophila but also play roles in transcription regulation and insulation [45, 46, 47, 48, 49, 50]. Moreover, it was shown that these three TFs showed distinct binding behaviours and were classified into three subclasses with respect to chromatin architecture [51, 52]. In our analysis we show that CTCF, BEAF32 and su(Hw) all exhibit different behaviours with respect to DNA binding.
CTCF has been shown to play a role in loop formation and participating in Topologically Associated Domains (TADs) boundary maintenance [50]. However, only a subset of CTCF sites are involved in these structures and that many CTCF sites do not conform to this rule [53, 54]. In our analysis, CTCF displayed strong sensitivity to DNA accessibility but reduced sensitivity to the number of bound molecules and scaling factor when DNA accessibility was considered (see Figures 3 and 4). Our findings suggest that CTCF binds to hotspots along the genome and this could be explained by the observation that the strongest peaks (based on our selection method - see Materials and Methods) are in fact highly conserved binding sites. As the number of sites increase, the conservation of binding sites decreases, as does the goodness fit. Thus, CTCF binding to highly conserved sites can be explained by our model, but something else is is responsible for the reduce binding at less conserved sites (i.e. cell specific CTCF binding) [55].
BEAF-32 is a Drosophila specific genetic insulator [56] that shows preferential binding towards TAD boundaries, but also is involved in transcription itself. More specifically, BEAF-32 was identified as a cis-regulatory element separating close head-to-head genes with different transcription regulation modes [57]. In Drosophila, there is a high density of these genes through out the genome and BEAF-32 tends to bind closely to the TSS [58]. This is further confirmed by studies showing that BEAF-32 has uniform binding along the entire genome [51]. TSSs are generally considered open chromatin and, if BEAF-32 binds in close proximity of the TSSs, it comes to no surprise that BEAF-32 would show a high sensitivity towards DNA accessibility. Our results confirm that BEAF-32 shows a strong preference towards DNA accessibility and, to a lower extent to local abundance.
Furthermore, we show that su(Hw) binds in both open and closed chromatin and also displays a high sensitivity towards number of bound molecules and scaling factor. There is a significant body of work showing the role su(Hw) plays in chromatin insulation and remodelling [59, 60, 61, 62]. It had been suggested that the role of insulator is only possible when paired with other DNA binding proteins such as Cp190 and mdg4. su(Hw) is also a primary actor in the interaction between the genome and nuclear lamina (also know as Lamina Associated Domains) [63, 64]. Both chromatin insulation and LADs would induce closed chromatin in order to maintain chromosomal structure and this would explain why su(Hw) can bind in both open and closed chromatin. In this context, ChIP-seq peaks might not overlap well with DNase hyper sensitivity data, which would be even more the case for our highly stringent method of selecting DNA accessibility sites from DHS sites (see Materials and Methods).
It has been shown that su(Hw) binding sites tend to cluster together (with varying number of sites) and that these sites are constitutively bound by su(Hw) [65, 66]. Interestingly, it seems that only isolated high affinity sites had a role in transcriptional regulation and the clustered sites were more involved in chromatin architecture. If cluster binding sites are constitutively bound and the density of these cluster vary along the genome, this would suggest that the number of bound molecules and how well they discriminate between low and high affinity sites, is a strong driver towards su(Hw) binding. We show that if DNA accessibility is not considered su(Hw) was sensitive to these two factors (see Figure 3 and Figures S5 and S6 in the Supplementary Material) thus suggesting a mechanistic explanation of this behaviour.
DNA accessibility is the main driver of binding to DNA for some TFs
Our results show that DNA accessibility and number of bound molecules control the binding profiles of TFs (Figures 3 and 4). When we estimated the binding parameters (λ and N) in one cell line and then predicted TF binding profiles in a different cell line based on changes in DNA accessibility and number of TF molecules (using changes in RNA-seq), we found a good agreement between our predictions and the actual ChIP-seq dataset (see Figure 6). Nevertheless, the changes in number of TF molecules between the two cell lines did not seem to make any difference to the predicted profiles (compare blue and dashed red line in Figure 6 B, D and F). This indicates that biological relevant fluctuations in TF numbers between different cell lines would have little effect on the differences in binding profiles of TFs and those differences are mainly driven by changes in DNA accessibility. Furthermore, only when reducing the TF concentration by 1000, we observed a noticeable decrease in the predicted ChIP profile, which suggests that only strong knockdowns or overexpression would affect binding of TFs and, consequently, lead to changes in the expression of target genes. It should be noted that the TFs we analysed here (CTCF, BEAF-32 and su(Hw)) are highly expressed architectural and insulator proteins and, thus, they would be saturating their binding sites.
Why would relative medium changes in concentration of the TF have such a limited effect on the binding of the TF? One potential explanation is that TFs control the expression of essential genes that should be tightly regulated despite fluctuations in number of molecules that affect the cell [67, 68]. This would be a buffering mechanism for the fluctuations in protein numbers in the cell.
Finally, we also investigate the capacity of our model to differentiate between TFs that can bind only in open chromatin and TFs that can also bind in less opened chromatin. For that, we looked at three Hox TFs displaying different preferences for DNA accessibility. Our results showed that while Ubx displays a strong sensitivity to open chromatin and binds in the top 1% accessible sites, the binding of Dfd and Abd-B is less influenced by DNA accessibility (with Abd-B and Dfd binding in top 5% and 20% respectively accessible regions); see Figure 7. Interestingly, our statistical thermodynamics model is better predicting the binding profile of Ubx (AUC of 0.862) compared to Abd-B and Dfd (with AUC 0.79 and 0.82 respectively).
Hox TFs are known for having a similar motif, but displaying differences in their binding profiles [69, 70, 71]. It was hypothesised that binding cooperativity could explain the difference in binding profiles coupled with protein sequence changes [72, 73, 74]. Here, we showed that DNA accessibility could also be responsible for the difference in binding profiles of Hox TFs (see Figure 7). Interestingly, our results support a model where Hox TFs would be able to bind to regions of DNA showing different level of accessibility and the DNA accessibility would be sufficient to explain these differences in the binding profiles of Hox TFs. Nevertheless, we also observed a poorer quality in modelling the binding profiles of TFs that can bind in dense chromatin (e.g., Abd-B or Dfd), which suggests that cooperative binding would be required to explain their binding. Due to the fact that our model does not include cooperativity, the predictions for these TFs would not be as accurate as in the case of TFs that preferentially bind to open chromatin.
Background noise and experimental artefacts remain a challenge in TF binding predictions
We found that many ChIP datasets suffer from significant background noise that would impede our ability to accurately assess the goodness of fit of the model. This ability is a corner stone in our understanding of the biological implication arising from our findings. Despite our approaches to reduce background noise, it seems that ChIP-seq data will always suffer from unspecific DNA pull-down [75]. More complex method of signal filtering are available, and applying these methods could potentially lead to significant reduction in the noise of ChIP-seq signal.
Another possibility is that the noise in ChIP signal could be the results of unspecific binding of TFs to DNA followed by one-dimensional random walk along the genome [76, 77, 78, 79, 80]. For the purpose of our analysis, we selected only sites of DNase hypersensitivity and considered these regions as strictly open. However, regions that were marked as closed chromatin between clusters of open regions might in fact either be partially open or dynamically opened, thus, leaving time and space for 1D molecular walks along the genome [81]. Discerning real TF binding and experimental artefacts remains extremely challenging.
We showed that choosing a goodness of fit method is context dependant. Interestingly, similarity methods (such as correlation, F-score or AUC) had the tendency of correctly calling peaks location but greatly underestimated the enrichement on the peak (see Figure 2). This behaviour results from the fact that these method are highly penalised by false positive hits. They show a wide range of optimal values for the number of bound molecules, but they tended to prefer low values for the scaling factor. This scaling factor can be described as how well a TF discriminates between binding sites; i.e., how much a TF will prefer a strong binding site over a weaker one. High values for the scaling factor translates to poorer ability for the TFs to discriminate between high and low affinity sites, which leads both to a higher number of false positive peaks and the model picking up smaller peaks. Smaller peaks could be caused by lower affinity binding or suboptimal binding sites along the genome as described by [82], but these binding sites would not be picked up by the similarity methods. The number of bound molecules on the other hand tend to affect the height of the peak (relative local enrichment). Similarity method would avoid inflating these sites as this would penalised their goodness of fit score more severely than dissimilarity methods. Dissimilarity methods (such as MSE or geometric ratio) showed a much higher number of bound molecules and a high value for the scaling factor (see Figure 2).
It is interesting to see that each method is penalised by different aspects of the model. For these reasons, we believe that choosing the right method will depend on the question at hand. Similarity methods could be used to determine peak location, but, if the TF local enrichment is of interest, a dissimilarity metric would be more appropriate.
FUNDING
This work was supported by University of Essex and by the Wellcome Trust grant [202012/Z/16/Z].
Conflict of interest statement
None declared.
ACKNOWLEDGEMENTS
We thank Dr Rob White for sharing the Hox ChIP-seq and ATAC-seq data and for comments on this manuscript. We also thank Professor Sarah Bray and Zabet lab for useful discussion and comments on the project and the manuscript. We would also like to thank Dr. Gorrie-Stone for his comments and suggestions during the development of ChIPanal-yser.
The analysis was performed on the HPC at University of Essex and we would like to thank Stuart Newman for his support on using the cluster.
References
- [1].↵
- [2].↵
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].↵
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].↵
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].↵
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].↵
- [70].↵
- [71].↵
- [72].↵
- [73].↵
- [74].↵
- [75].↵
- [76].↵
- [77].↵
- [78].↵
- [79].↵
- [80].↵
- [81].↵
- [82].↵