Abstract
We report a newly-identified bias in CLIP data that results from cleaving enzyme specificity. This bias is inadvertently incorporated into standard peak calling methods [1], which identify the most likely locations where proteins bind RNA. We further show how, in downstream analysis, this bias is incorporated into models inferred by the state-of-the-art GraphProt method to predict protein RNA-binding. We call for both experimental controls to measure enzyme specificities and algorithms to identify unbiased CLIP binding sites.
The peak-calling process in CLIP experiments derives peaks from raw sequences [1]. Bound RNAs are cleaved by RNase T1, which cuts at accessible G’s; as a result, a majority of called peaks terminate at a G (Figure 1A). To demonstrate this, we analyzed CLIP data used in the GraphProt study [2] (see Supplementary Methods) and found the presence of the ‘terminating G’ effect. For each sequence in the peak and control sequences, we calculated the frequency of G’s at the last position. We found that there is a much higher frequency of G’s at the last position of the peak sequences as opposed to in the control sequences (Figure 1B). In the most extreme case, more than 90% of the last nucleotides in the peak sequences are G’s, as compared to 25% in the control sequences. In contrast, when we analyzed the raw sequencing data and its nucleotide frequencies, we did not observe a ‘terminating G’ effect (e.g., Figure 1C); suggesting that this bias is introduced in the peak-calling. In addition to the ‘terminating G’ bias, we also observed a ‘G-depletion’ bias in the peaks, in concordance with Kishore et al. [3] (Supplementary Figure S1A). Complete results are in Supplementary Table S1.
To investigate whether this bias is incorporated in computational models, we trained the state-of-the-art method for protein RNA-binding prediction, GraphProt [2], on their original data with three different flanks (denoted by peaks ± length): peaks flanked by 150bps– peaks+150 (as used in the original GraphProt study); peaks alone– peaks; and peaks without the last nucleotide– peaks-1 (see Supplementary Methods for details and justification). We tested the models through 10-fold cross-validation, as in Maticzka et al. [2]. We found that models which incorporated the terminating G’s improved their prediction accuracy, an improvement that was even more pronounced without the flanks (Figure 1D). The average AUC was 0.913 for peaks, as compared to 0.877 for peaks+150 (with a p-value = 0.00013, determined by the Wilcoxon rank-sum test, comparing results on 24 CLIP experiments). When we removed the last nucleotide from the peaks, the accuracy dropped significantly (Figure S1B); the average AUC for peaks-1 was 0.887, as opposed to that of 0.913 achieved with peaks (p-value = 0.00087, using the same p-value test here and henceforth). When we investigated the underlying causes of this discrepancy, we observed that many of the highly weighted GraphProt features were encoding terminating G’s (Table 1). Remarkably, when we based our predictions solely on whether the sequence had a terminating G, we were able to achieve higher AUCs than that achieved by GraphProt for 5 (out of 24) experiments (Supplementary Figure S1C). Complete results are in Supplementary Table S2.
To resolve the effect of this bias on inferred binding models, we modified the GraphProt algorithm to exclude features that encode terminating nucleotides (Supplementary Methods). We ran the modified version on the different flank lengths, as before, resulting in six combinations of algorithm version and flank length. As expected, the performance of the modified version decreased significantly as compared to the original algorithm: the average AUCs for peaks+150, peaks, and peaks-1 were 0.870, 0.891, and 0.877, as compared to 0.877, 0.913, and 0.886 for the original feature set, respectively (p-value = 2.3⋅10−7) (Figure 1E). Surprisingly, the performance did not decrease significantly when the last nucleotide was removed: the average AUC for the modified version was 0.891 on peaks versus 0.877 on peaks-1 (p-value = 0.002) (Figure 1F; Figure S1D for a comparison of peaks+150 to peaks), implying that the modified version still captures the terminating G bias, albeit with less efficiency. To summarize, GraphProt’s performance decreased without having the terminating features, yet the remaining features still captured some of this bias.
Conclusions
Here we reported a newly identified bias in the CLIP analysis pipeline. The source of the technological bias comes from enzyme specificity [3], although it is truly introduced only in computational peak-calling [1]; this bias leads to prediction of ‘new’ binding sites that result directly from this artifact, as demonstrated in the cross-validation results [2]. While the GraphProt prediction algorithm performs worse when terminating features are excluded from its feature set, it still benefits from having the ‘terminating G’ in the peaks and may also benefit from adjacent positions that are due to enzyme specificity. This finding implies that the observed bias cannot be easily removed computationally; distinguishing between the true protein binding preferences and the enzyme specificity may be impossible when the peaks are determined by both. Knowing the enzyme specificity in advance may allow us to deconvolve protein-binding signal from enzyme specificity and thereby accurately call unbiased peaks.
Thus, to solve the specific bias of ‘terminating G’s’ in CLIP data, we call for an appropriate experimental control. Previous reported controls accounted for RNA expression levels, but no control measured the cleaving preference of the enzyme [3]. The experiment should measure the enzyme specificity without the presence of an RNA-binding protein. Having such a control will allow assignment of prior cleaving probabilities to genomic-loci in the peak-calling (as done in Uren et al. [1] for other co-variants) and identification of unbiased binding sites. Such controls will lead to better algorithms to predict protein RNA-binding and thus more accurate prediction of new binding sites.
Acknowledgments
This study was supported by NIH grant R01GM081871.