## ABSTRACT

Chromatin immunoprecipitation followed by deep sequencing (ChIP-Seq) is a widely used approach to study protein-DNA interactions. To analyze ChIP-Seq data, practitioners are required to combine tools based on different statistical assumptions and dedicated to specific applications such as calling protein occupancy peaks or testing for differential occupancies. Here, we present GenoGAM (Genome-wide Generalized Additive Model), which brings the well-established and flexible generalized additive models framework to genomic applications using a data parallelism strategy. We model ChIP-Seq read count frequencies as products of smooth functions along chromosomes. Smoothing parameters are estimated from the data eliminating ad-hoc binning and windowing needed by current approaches. We derived a peak caller based on GenoGAM with performance matching state-of-the-art methods. Moreover, GenoGAM provides significance testing for differential occupancy with controlled type I error rate and increased sensitivity over existing methods. By analyzing a set of DNA methylation data, we further demonstrate the potential of GenoGAM as a generic analysis tool for genome-wide assays.

## INTRODUCTION

Chromatin immunoprecipitation followed by deep sequencing (ChIP-Seq) is the reference method used for genome-wide quantification of protein-DNA interactions^{1}. It is used to study a wide range of fundamental genome biology processes covering transcription, replication, and maintenance. ChIP-Seq consists of cross-linking DNA with chromatin, followed by DNA fragmentation and immunoprecipitation of the protein of interest along with its bound DNA fragments. The DNA fragments are then released, amplified, and sequenced. ChIP-Seq has been applied for studying DNA-bound proteins of various functions and therefore with various patterns of distribution along the genome. These include transcription factors that are bound at discrete binding sites^{2,3}, histone modifications^{3,4} which are found at nucleosomes, or the transcription^{3} and replication machinery which are even more broadly distributed. Often, the quantities of interest are the occupancies relative to technical controls, such as the input (a sample that was not subject to the immunopre-cipitation step), between genetic backgrounds, treatments, or combinations thereof.

Although ChIP-Seq is a very generic methodology to study protein-DNA interactions, statistical analysis methods have been so far dedicated to specific applications. Early work has focused on transcription factors with discrete binding sites, typically DNA motifs at promoters or tran-scriptional enhancers^{2,5}. ChIP-Seq read coverage then shows peaks localized at the binding sites. The aim of these statistical methods is to identify these peaks and their statistical significance, typically by controlling the false discovery rate. For example, MACS^{5} is a widely used^{6,7} peak caller that assumes a Poisson distribution for the count data and computes peak significance based on a combination of global and local rate. ZINBA^{8} combines a negative binomial mixture model for background and enriched regions with a zero inflated component for regions with excessive zero counts. The specific calling of narrow and wide peaks was made possible by JAMM^{9}, which makes use of replicates, and is based on a mixture model of enriched and non-enriched regions.

Testing for differential overall occupancies at regions of interest across conditions is done by testing for differences in number of reads overlapping the region ^{10}. Complementary to testing for overall occupancies, MMdiff^{11} allows testing for differences in shapes in given regions. Lun et al.^{12} provide a framework to test differential occupancies between conditions across windows in given regions while properly controlling for false discovery rate. This approach allows both testing for differences in overall occupancies and in shapes.

Hence, practitioners rely on different statistical frameworks for peak calling tasks and differential occupancies. However, flexible handling of replicates and additional control factors is not always possible. Moreover, current methods rely on binning and sliding window techniques, whose choice of the window size is not data-driven but subjective. Another limitation is that the more general task of statistical inference of a genome-wide bias-corrected occupancy track is not addressed.

Here, we introduce GenoGAM (Genome-wide Generalized Additive Model), which provides a statistical framework to simultaneously address the above issues. Our model describes genome-wide occupancy by smooth functions, which facilitate downstream applications such as peak calling or differential binding analysis. GenoGAM normalizes for sequencing depth and can handle factorial experimental designs, including biological replicates and multiple controls. The amount of smoothing is estimated in an automatic, data-driven manner and thus avoids introducing subjectivity from the analyst. When analyzing differential binding in a factorial design, we obtain well-calibrated per-base-pair p-values. Application to datasets of human and yeast shows that GenoGAM is as performant as dedicated methods for peak calling and much more sensitive than state-of-the art differential occupancy methods. By providing an approximation to a conventional generalized additive model (GAM^{13}) that allows a data parallelism implementation, GenoGAM scales linearly with the number of data points and is thus computationally amenable to whole-genome applications. Our method provides a framework that is applicable not only to ChIP-Seq data, but also to other next-generation sequencing data such as DNA methylation data (Figure 1a).

## RESULTS AND DISCUSSION

### A generalized additive model for ChIP-Seq data

We consider an experiment consisting of a set of ChIP-Seq samples. A data point is defined by a pair of a ChIP-Seq sample and a genomic position. We denote by *x*_{i} the genomic position of the *i*-th data point, by *j*_{i} its ChIP-Seq sample and by *y*_{i} ≥ 0 the number of fragments in sample *j*_{i} centered at position *x _{i}.* For single-end libraries, the fragment center is estimated by shifting the read end position by a constant (Methods). When reducing ChIP-Seq data to fragment centers rather than full base coverage, each fragment is counted only once. This reduces artificial correlation between adjacent nucleotides. We model the counts

*y*

_{i}using the following generalized additive model:

The counts *y*_{i} are assumed to follow a negative binomial distribution with means *μ*_{i} (equation 1) and a dispersion parameter *θ* that relates the variance to the mean such that . Consequently, the model accounts for overdispersion^{10}. The logarithm of the mean *μ*_{i} is the sum of an offset *o*_{i} and one or more smooth functions *ƒ*_{k} (equation 2). The offsets *o*_{i} are predefined data-point specific constants that account for sequencing depth variations (Methods). More elaborate usage could include position-and sample-specific copy number variations, or GC-biases. The indicator variable *z*_{ji},*k* values 1 if the smooth function *ƒ*_{k} contributes to the mean counts of sample *j*_{i} and 0 otherwise. As demonstrated in the Methods section, this formulation allows modeling IP versus input experiments as well as factorial experimental designs.

We modeled IP versus input experiments using GenoGAM with two smooth functions: *ƒ*_{input} that contributes to both input and IP samples, and *ƒ*_{protein} that only contributes to IP samples. More specifically, *ƒ*_{input} models local ChIP-Seq biases common to input and IP, whereas *ƒ*_{protein} models the protein log-occupancy up to one genome-wide scaling factor. Figure 1b shows the application of this model to one ChIP-Seq library for the *S. cerevisiae* general transcription factor TFIIB and its input control (Methods).

In GenoGAM, the smooth functions are represented by cubic spline curves, which are written as linear combinations of a set of regularly spaced B-spline basis functions *b*_{r}, i.e. . We chose second order B-splines as basis functions, which are bell-shaped cubic polynomials over a finite support^{14}. To avoid overfitting, additional smoothing of the functions *ƒ*_{k} is carried out by penalization of the second order differences of the spline coefficients, which approximately penalizes second order derivatives of *ƒ*_{k} - an approach called P-splines (penalized B-splines ^{15}). The optimization criterion for P-splines is the sum of the negative binomial log-likelihood (depending on the response vector *y* and the vector *β* containing the coefficients of all smooth functions) plus a penalty function that is weighted by the smoothing parameter λ:
where S is a symmetric positive matrix that encodes the squared second order differences of the coefficients *β*^{15}. This regularization allows dense placements of the basis functions (between 20 and 50 bp), while relying on the smoothing parameter λ to protect against overfitting. Large values of λ yield smoother functions. A single smoothing parameter common to all smooth functions proved to be sufficient for our applications. For given λ and *θ*, model fitting was performed using penalized iteratively re-weighted least squares (Methods).

The penalized likelihood can also be interpreted in a Bayesian fashion^{16}, where a multivariate Gaussian prior is placed on the coefficient vector *β*. Large-sample approximations then yield a multivariate Gaussian posterior distribution for *β*, and, by the linearity of , Gaussian posteriors for the point estimates *ƒ*_{k}(*x).* This allows for the construction of pointwise confidence bands ^{16}. An example of the fitted smooth functions and their confidence bands for the yeast transcription factor TFIIB is shown in Figure 1b.

### Data-driven determination of the smoothing and dispersion parameters

To determine the optimal value for λ and *θ*, generalized cross-validation, which is based on an analytical leave-one-out large-sample approximation^{16}, yielded very wiggly fits indicative of over-fitting. We thus developed an empirical cross-validation scheme. To reduce computational time, cross-validation was performed on a subset of all data. To this purpose, we selected a sufficiently large set of distinct regions that are long enough to not suffer from border effects common to spline fitting. Using 40 or more distinct regions containing at least 60 basis functions gave satisfactory empirical results (Supplementary Table 1). Also, it was important to select regions relevant for the desired application. For peak calling purposes, regions were selected that had the most significant fold change of IP versus input read counts (Methods). In each region, 10-fold cross-validation was performed, where a tenth of the data points were removed, the model was fitted on the remaining data points, and the log-likelihood of the left-out data points was computed. Parameter combinations were scored for the total out-of-sample log-likelihood over all regions. Short range correlations are strong in ChIP-Seq data and are not fully controlled by replicates or input experiments. To avoid overfitting due to short range correlations, each cross-validation fold did not consist of randomly selected single base pairs but of short intervals. The length of these intervals was about a tenth the average fragment size in absence of replicates and twice the average fragment sizes when replicates were available (Methods). Investigation on grid values of *θ* and λ showed that the out-of-sample log-likelihood was typically unimodal. We therefore used Nelder-Mead optimization ^{17} to jointly fit the two parameters in a computationally faster way than grid search.

### Fitting a GAM genome-wide

Since the computation time of a GAM grows polynomially with the number of basis functions, fitting one model to a whole chromosome is unfeasible. Instead, we propose to fit separate GAMs on sequential overlapping intervals (or tiles, Fig. 2a). As overlap length increases, agreement of the fit at the midpoint of the overlapping region increases. A genome-wide fit is obtained by joining together tile fits at overlap midpoints (Fig. 2a). This approximation yields computation times that are linear in the number of basis functions at no practical precision cost (Fig. 2b). Furthermore, it allows for parallelization, with speed-ups being linear in the number of cores (Fig. 2c). This approximation parallelizes the computation over the data, which will allow future implementation of GenoGAM in map-reduce frameworks such as Spark^{18}.

### GenoGAM provides a competitive peak caller

Because analytical derivatives of P-splines are available, identifying peaks of the protein occupancy *ƒ*_{protein} is straightforward by extracting local maxima where (Methods, Supplementary Fig. S1). To assess statistical significance of the peak heights, we introduced an empirical z-score that contrasts the estimate of the log-occupancy *μ* at the peak to a robust estimate of background log-occupancy level *μ*_{0}, taking both background level variance and uncertainty of peak height *σ*^{2} into account (see Methods for their estimation):

A practical approach to model the null distribution of peak scores is to assume that false positive peaks arise from symmetric fluctuations of the background and thus distribute similarly to local minima, or peaks found when inverting the role of input and IP^{5}. We therefore estimated the false discovery rate using the z-score distributions of the local minima (Methods).

We first compared the performance of GenoGAM, MACS^{5}, JAMM^{9} and ZINBA^{8} in identifying binding sites of TFIIB. For about 20% of yeast promoters, recruitment of TFIIB is triggered by the well-characterized DNA element TATA-box, providing at these promoters a ground truth for a TFIIB occupancy peak^{19}. We mapped 1,105 TATA-boxes genome-wide by regular expression of a consensus motif (Methods) and considered 1 kb regions centered on TATA-boxes for benchmarking. In these regions, significant peaks (FDR < 0.1) from GenoGAM were substantially closer to TATA-boxes than those of alternative methods (median absolute distance 58 bp, third quartile 144 bp for GenoGAM versus 152 and 247 bp for MACS, 82 and 174 bp for JAMM, and 155 and 237 bp for ZINBA, respectively Fig. 3a).

Moreover, the proportion of peaks within 30 bp of a TATA-box center was twice as high as for any other method independently of the number of reported peaks (Fig. 3b), showing that the improvement was robust to the score threshold. We performed a similar benchmark (Methods) on the human chromosome 22 for 6 transcription factors of the ENCODE project^{6} selected to be representative of accuracies in predicting ChIP-Seq peak positions from sequence motifs ^{20} (CEBPB, CTCF, MAX, USF1, PAX5, and YY1). On these data, GenoGAM performance was comparable to the other methods (95% boostrap confidence intervals, Fig. 3c for significant peaks, and Supplementary Fig. S2 for distance distributions, Supplementary Fig. S3 for all cutoffs). Hence, although GenoGAM is a general framework for ChIP-Seq analysis, it nonetheless provides a peak caller that is at least as performant as dedicated tools.

We next investigated the reason for the drastic differences observed in the yeast TFIIB dataset between GenoGAM and the other methods. The TATA-box region of IDH2 illustrates the issue (Fig. 4a). The peaks reported by GenoGAM are positions with maximal a posteriori estimate of IP over input fold-changes. In contrast, MACS and JAMM report positions with maximal statistical significance^{5,9}. Because statistical significance increases with both effect size and sample size, this leads to peak calls biased toward positions with high total counts in IP and input (Fig. 4a). Across all 644 TATA-box regions at which both GenoGAM and MACS identify a peak, total counts within 50 bp of peak positions were higher for MACS, but count ratios were higher for GenoGAM (Fig. 4b), generalizing the observations made for IDH2. The yeast TFIIB dataset was sequenced at a much higher coverage than the ENCODE dataset (0.9 unique fragments per base in average versus less than 0.03 unique fragments per base in average), leading to stronger discrepancies between significance and robust fold-changes. As sequencing depth is expected to increase in the near future, we anticipate that robust fold-change estimates as provided by GenoGAM will be a more sustainable criterion than mere significance for calling peak positions.

### Higher sensitivity in testing for differential occupancy

To assess the performance of GenoGAM for calling differential occupancy, we re-analyzed histone H3 Lysine 4 trimethylation (H3K4me3) ChIP-Seq data of a study^{21} comparing wild type yeast versus a mutant with a truncated form of Set1, the H3 Lysine 4 methylase. H3K4me3 is a hallmark of promoters of actively transcribed genes. Thornton and colleagues^{21} have reported genome-wide redistribution in the truncated Set1 mutant of H3K4me3, which is depleted at the promoter and enriched in the gene body. We modeled this data with GenoGAM using one smooth function *ƒ*_{WT} for the wild type reference occupancy, and one further smooth function *ƒ*_{mutant/WT} for the differential effect. The offsets were computed to control for variations in sequencing depth between replicates and overall genome-wide H3K4me3 level (Methods). This yielded base-level log-ratio estimates and their 95% confidence bands genome-wide (Methods, Figure 5a for data and fit at the gene *YNL176C* consistent with the report of reduced binding at promoter regions).

As mentioned above, the confidence bands are Bayesian credible intervals. Previous studies based on simulated data showed that these confidence bands have close to nominal coverage probabilities and can, in practice, be used in place of frequentist confidence intervals ^{22}. We estimated base-level p-values using the point-wise estimates and standard deviations (Methods). To empirically verify that the p-values were at least conservative, we created a negative control dataset by per-base-pair independent permutation of the counts between the four samples. The offsets were set to 0 and the smoothing and dispersion parameters were estimated again. This non-parametric permutation scheme makes less assumptions than previous simulation studies^{22}. Nonetheless, per-base-pair p-values in this negative control experiment were slightly overestimated (Figure 5b). These results show that GenoGAM can be used to identify individual positions of significant differential occupancies with controlled type I error. Here, correction for multiple testing can either be done using the Benjamini and Hochberg procedure^{23} or, procedures that exploit dependencies between adjacent positions^{24}.

Complementary to de novo identification, predefined regions, such as genes, can be tested for differential occupancies. To test for differences at any position in a region using GenoGAM, we propose to apply Hochberg’s procedure to correct the pointwise p-values for multiple testing, and to report the smallest of these corrected p-values (Methods). To validate this approach, we compared GenoGAM against the following three approaches: csaw^{12}, which also tests for differences at any position in the regions, DESeq^{10}, which tests for differences in the overall occupancies, and MMdiff^{11}, which tests for differences of distribution within the regions but not overall occupancy (Methods). All investigated methods empirically controlled type I error on the permuted dataset at the 5% nominal level (Supplementary Fig. S4). On the original dataset, the least number of significant genes (FDR < 0.1) were identified by DEseq (735) and MMdiff (5). The csaw algorithm gave up to 863 significant genes but the number of identified genes depended strongly on the choice of the window size (Figure 5c). Of all methods, GenoGAM was the most sensitive, reporting 4,409 significant genes. Up to 861 genes, these genes were a superset of the genes reported by csaw, indicating that GenoGAM captured the same signal but with a higher sensitivity (Figure 5d). The genes reported only by GenoGAM showed a differential occupancy pattern similar yet weaker to the genes common to csaw and GenoGAM, with depletion in the promoter and enrichment in the gene body (Figure 5d), indicating that GenoGAM captured true biological signal.

### Application to DNA methylation data

Generalized additive models are based on the generalized linear modeling framework and thus allow any distribution of the exponential family for the response. Therefore, GenoGAM can be also used to model continuous responses, for instance using the Gaussian distribution, and proportions using the Binomial distribution. For ChIP-Seq data, a log-linear predictor-response relationship of the form (2) is justified by the fact that effects on the mean are typically multiplicative. However, other monotonic link functions could also be used. Moreover, quasi-likelihood approaches are supported, allowing for the specification of flexible mean-variance relationships^{25}.

To test the flexibility of GenoGAM, we conducted a proof-of-principle study on modeling bisulfite sequencing of bulk embryonic mouse stem cells grown in serum^{26}. Bisulfite sequencing quantifies methylation rate by converting cytosine residues to uracil, leaving 5-methylcytosine residues unaffected. At each cytosine, the data consisted of the number *n*_{i} of fragments overlapping the cytosine and the number *y*_{i} of these fragments for which the cytosine was not converted to uracil. The quantity of interest was the methylation rate, i.e. the expectation of the ratio *y*_{i}/*n*_{i}. In the original publication, single nucleotide position methylation rates were estimated using a sliding window approach with an ad-hoc choice of window size of 3 kb computed in steps of 600 bp. Figure 6 reproduces an original figure showing the fit in a 120kb section of chromosome 6. We modeled this 120 kb section with GenoGAM using a quasi-binomial model, where the response was the number of successes *y*_{i} out of *n*_{i} trials, the log-odd ratio was modeled as a smooth function of the genomic position, and the variance was equal to a dispersion parameter times the variance of the binomial distribution. Smoothing and dispersion parameters were determined by cross-validation (Methods). The GenoGAM fit was consistent with the original publication^{26}, but did not rely on manually set window sizes and provided confidence bands (Figure 6). As expected, wider confidence bands were obtained in regions of sparse data and tighter bands in regions with a lot of data (Figure 6).

## CONCLUSION

We have introduced a generic framework based on generalized additive models to model ChIP-Seq data. We have made this possible by providing a scalable algorithm that can fit GAMs to very long longitudinal data such as whole chromosomes at base-pair resolution. Scaling was made possible by parallelization over the data and allowing approximations rather than exact computation of the fit ^{27}.

Smoothing and dispersion parameters were obtained by cross-validation, i.e. they were fitted for the accuracy in predicting unseen data. This criterion turned out to provide useful values of smoothing and dispersion for inference, since we obtain signal peaks close to actual binding sites of transcription factors when these are known, at least as close as dedicated tools. Moreover, this criterion also led to reasonable uncertainty estimates since confidence bands of the fits were found to be only slightly conservative.

The utilization of genome-wide GAMs comes with a number of advantages: First, we flexibly model factorial designs, as well as replicates with different sequencing depths using size factors as offsets. Second, applying GAMs yields confidence bands as a measure of local uncertainty for the estimated rates. We showed how these can be the basis to compute point-wise and region-wise p-values. Third, GAMs outputs analytically differentiable smooth functions, allowing flexible downstream analysis. We showed how peak calling can be elegantly handled by making use of the first and second derivatives. Fourth, various link functions and distributions can be used, providing the possibility to model a wide range of genomic data beyond ChIP-Seq, as we illustrated with a first application on DNA methylation. In conclusion, we foresee GenoGAM as a generic method for the analysis of genome-wide assays.

## METHODS

### Preprocessing

Fragments were centered, reducing each fragment to one single data point. In case of single end data, the fragment length d was estimated using the Bioconductor package chipseq and its coverage method. It is defined as the optimal shift for which the number of bases covered by any read is minimized. Thus, the center was taken as the start of the read shifted by downstream.

### Sequencing depth variations

Variations for sequencing depth was controlled by using size factors computed by DESeq2 ^{28} (version 2 1.10.0 here and after). This method robustly estimates fold-changes in overall sequencing depth by comparing read counts of predefined regions. The selection criteria for these regions was application-specific.

For peak calling applications, the selected regions were the 1,000 tiles with smallest p-value according to DESeq2 test for enrichment of IP over input performed on total read counts per tile. This allowed to select tiles that were most likely containing peaks. For differential binding application, all tiles were considered.

### Model fitting

#### Model fitting given λ and θ

Each chromosome was partitioned into equally-sized intervals called chunks. Tiles were defined as chunks extended on either side by equally-sized overhangs. The generalized additive model was fitted on each tile separately using the gam function of the R package mgcv. Point estimates at each base pair of the smooth functions and their standard errors were extracted with the predict function on the fitted object setting “type” parameter to “iterms”. The tile fits were then restricted to their chunk to define the chromosome-wide fit.

#### Fitting of λ and θ

The parameters λ and θ were the same for all tiles and were estimated using 10-fold cross-validation on a subset of all tiles. The selection of relevant tiles for cross-validation was application-specific as outlined in the respective sections below. To avoid overfitting due to short range correlation, each cross-validation fold did not consist of randomly selected single base pairs but of short intervals. When replicate samples were available (that is all except for the TFIIB dataset), intervals could be of greater length as the model can predict samples from the respective replicates. We set it to twice the estimated fragment length. In the absence of replicates (TFIIB dataset), interval length was set to 20 bp (approximately a tenth of the fragment length.). For a given pair of values for *λ and θ*, the score function was defined as the sum of out-of-sample log-likelihood over all cross-validation folds and all tiles, restricted to the data points within chunks to not depend on poor fitting in overhangs. The parameters *λ and θ* were estimated by gradient-free numerical optimization of the score function using the Nelder-Mead algorithm (R function *optim).*

### Yeast TFIIB dataset

#### ChIP-Seq library preparation, sequencing and read alignment

ChIP-Seq for TFIIB was performed essentially as described previously ^{29} with a few modifications. Briefly, 600 ml BY 4741 S. cerevisiae culture with C-terminally TAP-tagged TFIIB (Open Biosys-tems) was used. Immunoprecipitation was performed with 75 *μ*l of IgG SepharoseTM 6 Fast Flow beads (GE Healthcare) for 3 hours at 4°C on a turning wheel. 30 μl of Input sample was taken before immunoprecipitation and stored at 4°C. IP and Input samples were reverse cross-linking for 2 hours with Proteinase K at 65°C and purified using Quiagen MinElute Kit. Samples were digested with 2.5 *μ*l RNase A/T1 Mix (2 mg/ml RNase A, 5000 U/ml RNase T1; Fermentas) at 37°C for 1 h, purified and eluted in 50 *μ*l H2O. ChIP-Seq libraries were prepared using NEB Next library preparation kit following manufacturer’s instructions using the complete 50 *μ*l as input. 2 *μ*l of 1.7 *μ*M adapters containing a GGAT barcode and 2 *μ*l of a 0.25 *μ*M adapter containing a CACT barcode were used for ligation with Input and IP samples, respectively. The final library was amplified for 22 cycles using Phusion Polymerase and purified using Agencourt Magnetic beads. 36 bp single end sequencing was performed on an Illumina GAIIX sequencer at the LAFUGA core facility of the Gene Center, Munich. Single-end 36 base reads and 4 base reads of barcodes were obtained and processed using the Galaxy platform^{30}. Reads were demultiplexed, quality-trimmed (Fastq Quality Filter), and mapped with Bowtie 0.12.7 ^{31} to the SacCer2 genome assembly (Bowtie options:-q-p4-S-sam-nohead-phred33-quals).

#### GenoGAM model

This dataset consisted of two samples: one input and one IP without replicates. Hence there was no need for an offset. We used the following GenoGAM model:
where *z*_{ji,protein} = 1 whenever *j*_{i} is the index of an IP sample and *z*_{ji,protein} = 0 whenever *j*_{i} is the index of an input sample. Further parameter details are given in Supplementary Table S1.

#### TATA box mapping

Promoter TATA boxes where defined as instances of the motif TATAWAWR^{19} at most 200 bp 5’ and 50 bp 3’ of one of the 7,272 transcript 5’-ends reported by Xu et al.^{32}.

### ENCODE transcription factors

#### Data processing

Alignment files (BAM files, aligned for the human genome assembly hg19) for ChIP-Seq data for the transcription factors CEBPB, CTCF, MAX, USF1, PAX5, and YY1 were obtained from the ENCODE website www.encodeproject.org. All these datasets contained two biological replicates for the protein samples and at least one input sample. However, the library sizes of the input samples were so small that including them resulted in higher uncertainty about the peaks, for our approach and for alternative approaches. We therefore conducted the analyses without correction for input.

#### GenoGAM model

For each transcription factor the dataset was modeled separately. Each one consisted of IPs with replicates. The following GenoGAM model was used:
where the offsets log(*s*_{ji}) are log-size factors computed to control for sequencing depth variation between the replicates (see section). Further parameter details are given in Supplementary Table S1.

#### Transcription factor motif mapping

Motif occurrences in the genome were determined by FIMO^{33} using default threshold 10^{−4} with position weight matrices (PWMs) from the JASPAR 2014 database^{34} with the following IDs: CEBPB: MA0466.1,CTCF: MA0139.1, MAX: MA0058.1, PAX5: MA0014.2, USF1: MA0093.2, YY1: MA0095.2

### Peak calling

#### GenoGAM-based peak calling and z-score

Values of first and second derivatives of fitted smooth functions were obtained by multiplying the estimated coefficients with the corresponding derivatives of the B-splines as obtained from the *spline.des* function of the R package splineDesign. Local extrema (at base pair resolution) were identified as positions at which the sign of the first derivative differed from the one of the preceding position. For the z-score (equation 4), *μ*_{0} is the global background mean and is the global background variance of *f(x).* In order to account only for the background without potential peaks, *μ*_{0} was estimated as the *shorth* from the Bioconductor genefilter package for all *f(x*_{i}), *i* = 1,…, *n* (midpoint of the shortest interval containing half of the data) of all fitted values. The fitted values smaller than the shorth were mirrored on it, such that a symmetric density was created that excludes the values larger than the shorth, in particular those high values representing peaks. The variance of this newly created distribution was then estimated in a robust fashion by the *median absolute deviation* (MAD) giving (Supplementary Figure S1).

#### False Discovery Rate for GenoGAM peaks

To estimate false discovery rates (FDR), peaks were called on —*ƒ*_{protein}. Their z-scores were obtained by recomputing *μ*_{0} and *σ*_{0} and applying the same formula. The FDR for a given minimum z-score *z* was estimated by , where *P*_{z} and *V*_{z} are the sets of peaks and valleys, respectively, with a z-score greater than or equal to *z.*

#### MACS, JAMM, and ZINBA

The version 2 of the MACS software, MACS2, was run with the default parameters and the additional flag *call-summits*. In case of TFIIB, the *nomodel* parameter was used to avoid building the shifting model. This was necessary since the default values for *mfold* were too high and resulted in worse performance if reduced, compared to absence of a model.

JAMM was run with default values and peak calling mode (*-m*) set to narrow assuming a three component mixture model for background, enriched regions and tails of enriched regions. Although JAMM computes a score to rank peaks it does not provide a method to define a threshold for a given FDR or significance. Nevertheless, JAMM applies some filtering on the complete list of peaks to output a filtered list. Instead of using this filtered output directly, we used the complete, sorted (by score) peak list and took the top *N* results where *N* is the number of peaks in the filtered output. This improved the performance of JAMM in some cases (and left unchanged in others). For analysis, where a cutoff for JAMM was still needed we used the same number of peaks that MACS reported.

For ZINBA, the mappability score was generated (*generateAlignability*) with the mappabil-ity files for 36 bp reads, taken from the ZINBA website https://code.google.com/p/zinba/. The average fragment length (*extension*) was specified at 190 bp, window size (*win-Size*) at 250 and offset (*offset*) at 125. The FDR threshold was set to 0.1 and window gap to 0. Peaks were refined (default) and model selection was activated. The complete model was used (*selecttype = “complete*”), input was included as a covariate (*selectcovs = “input_count*”) and interactions were allowed. The chromosome used to build the model was selected randomly to be “chrXVI” (*selectchr*). The parameter “method” was set *method = “mixture*”.

### Differential binding

#### Data processing

Raw sequencing files (H3K4ME3_Fulllength_Set1_Rep1.fastq, H3K4ME3_Fulllength_Set1_Rep_2.fastq, H3K4ME3_aa762-1080_Set1_Rep1.fastq, and H3K4ME3_aa762-1080_Set1_Rep_2.fastq) were obtained from the Sequence Read Archive (SRA) repository (http://www.ncbi.nlm.nih.gov/sra). These were paired-end reads. Reads were aligned to the SacCer3 build of the *S. cerevisiae* genome with the STAR aligner^{35} (version 2.4.0, default parameters). Reads with ambiguous mapping were removed using samtools^{36} (version 1.2 option *-q 255).* Gene boundaries were obtained from the *S. cerevisiae* genome annotation R64.1.1, restricting gff file entries of type “gene”.

#### GenoGAM model

This dataset consisted of four samples: two biological replicate IPs for the wild type strain, and two biological replicate IPs for the mutant strain. We used the following GenoGAM model:
where *Z*_{ji},_{mutant} = 1 for *j* index of one mutant sample and 0 for wild-type samples. The offsets *log(s*_{ji}) are log-size factors computed to control for sequencing depth variation and overall H3K4me3 across all four samples (see section). Further parameter details are given in Supplementary Table S1.

#### Position-level significance testing

Null hypotheses of the form *H*_{0}: *ƒ*_{k}(*x)* = 0 for a smooth function *ƒ*_{k}() at a given position *x* of interest were tested assuming approximate normal distribution of the corresponding z-score, i.e.:
where
where denote point estimate and standard error of the smoothed value ^{16} as returned by the function *predict(…, type=”iterms”, se.fit=TRUE)* of the R package mgcv.

#### False discovery rate for predefined regions

Let *R1*,…, *R*_{p} be *p* regions of interest, where a region is defined as a set of genomic positions. Regions are typically but not necessarily, intervals (e.g. genes or promoters). For instance, all exons of a gene could make up a single region. Regions can be a priori defined or defined on the data using independent filtering ^{37}. For instance, when testing for significant differences between two conditions, regions can be selected for having a large total number of reads over the two conditions ^{12}.

For *j* in 1, *‥,p*, let be the composite null hypothesis that the smooth function *ƒ*_{k} values 0 at every position of the region *R*_{j}:

The False Discovery Rate was controlled using the following procedure ^{12}:

1. Position-level p-values at all region positions were computed using position-level significant testing as described above.

2. Within each region

*R*_{j}, position-level p-values were corrected for multiple testing using Hochberg family-wise error rate correction^{38}. The Hochberg correction was applied because position-level p-values of one smooth function are positively associated. The p-value for the null hypothesis was then computed as the minimal family-wise error rate corrected position-level p-value. This step gives one p-value per region.3. FDRs were computed using Benjamini-Hochberg procedure

^{23}applied to the region-level p-values.

#### Benchmarking

The R/Bioconductor packages DESeq2^{28}, MMDiff^{11}, and csaw^{12} were applied on original count data and on the base-level permuted dataset, for all genes. The log-size factors were set 0 for all methods when applied to the permuted datasets. DESeq2 was applied with default parameters. MMDiff was applied with a bin length of 50 bp, the DESeq method for the normalization factor, and the Maximum Mean Discrepancy (MMD) histogram distance. The csaw method was applied with window size of 150 bp and otherwise default parameters. The window size was determined through a grid search (see Figure 5c), choosing the window size with the most significant genes. In particular, csaw uses a different procedure to estimate normalization factors than DESeq and MMDiff. We used the default one as it was in favor of csaw for returning more significant genes.

### Methylation data

#### Data processing

We obtained the data in text table format from Smallwood et al.^{26} from the Gene Expression Omnibus (GEO) repository http://www.ncbi.nlm.nih.gov/gds. The data provided, was one record per CpG site, with the number of methylated and unmethylated fragments at the respective site. We used a Python script to bin this data into bins of 3,000 bp width every 600 bp, as was done in the original paper.

#### GenoGAM model

To model *y*_{i}, the number of reads of methylated state, out of *n*_{i}, the total number of reads, we used the quasi-binomial model defined by:
where the scale parameter *θ >* 0 models dispersion. The model was applied on only one tile with a width of 120 kb, reproducing Figure 2a of Smallwood et al.^{26}. Further parameter details are given in Supplementary Table S1.

#### Accession code

ChIP-Seq data are available at Array Express under the accession number E-MTAB-4175. For review, user: Reviewer_E-MTAB-4175, password: NF3Qvgio

#### Code availability

Scripts used for this study are provided in Additional data file 2. A R package called GenoGAM has been submitted to Bioconductor^{39}. Refer to the Bioconductor web page at http://www.bioconductor.org for installation procedures.

## COMPETING INTERESTS

The authors declare that they have no competing interests.

## AUTHOR’S CONTRIBUTIONS

Conceived the project and supervised the work: JG AT. Developed the software and carried out the analysis: GS AE JG Carried out the ChIP-Seq experiments for TFIIB on yeast: DS. Gave advice on statistics: MS. Wrote the manuscript: JG AE GS MS AT

## ACKNOWLEDGEMENTS

We thank Ulrich Unnerstall and Michael Lidschreiber for fruitful discussions on data analysis, Martin Morgan and Hervé Pagès for support during the implementation of the GenoGAM package, Stefan Krebs for sequencing and raw data preprocessing, and Ulrich Mansmann and Patrick Cramer for institutional support. JG was supported by the Bavarian Research Center for Molecular Biosystems and by the Bundesministerium für Bildung und Forschung through the Juniorverbund in der Systemmedizin ‘mitOmics’ (FKZ 01ZX1405A). We acknowledge support from the European Commission through the Horizon 2020 project SOUND (GS, JG) and from the Graduate School for Quantitative Biosciences Munich (AE).