Abstract
BioFeatureFinder is a novel algorithm which allows analyses of many biological genomic landmarks (including alternatively spliced exons, DNA/RNA-binding protein binding sites, and gene/transcript functional elements, nucleotide content, conservation, k-mers, secondary structure) to identify distinguishing features. BFF uses a flexible underlying model that combines classical statistical tests with Big Data machine-learning strategies. The model is created using thousands of biological characteristics (features) that are used to build a feature map and interpret category labels in genomic ranges. Our results show that BFF is a reliable platform for analyzing large-scale datasets. We evaluated the RNA binding feature map of 110 eCLIP-seq datasets and were able to recover several well-known features from the literature for RNA-binding proteins; we were also able to uncover novel associations. BioFeatureFinder is available at https://github.com/kbmlab/BioFeatureFinder/.
Background
The emergence of high-throughput sequencing technologies has led to an enormous number of datasets available for researchers, and multiple types of analysis have been built on top of these technologies [1]. It is now possible to use strategies to identify protein binding sites (ex. ChIP [2]/CLIP-seq [3]), alternative splicing (AS) events [4], and differentially expressed genes [5], detect SNPs [6] and achieve a multitude of other applications [7,8], resulting in large sets of genomic coordinates (e.g., binding sites, AS exons, polymorphisms). Such results are particularly challenging to interpret from a biological perspective. Several approaches have been used to characterize genomic coordinates sets and identify the enriched characteristics (features) in these datasets, especially when using the results from ChIP-seq or CLIP-seq experiments [8–12]. However, several of the most commonly used tools in these analyses focus on a particular aspect of the target regions in their process, such as structural models or sequence motifs [9,11,13]. Although these tools provide valuable insight into which characteristics are enriched in the genomic regions associated with these datasets, there is a clear deficiency in the computational tools that can perform more comprehensive analyses and integrate multiple types of sources of variation.
In previous high-throughput studies, some individual features were revealed to be key contributors in large datasets: GC content [14–16], nucleotide composition [14], length [14,16], CpG islands [16,17], conservation [18], microRNA [19,20] and protein-binding [21] target sites, methylation sites [22], single nucleotide polymorphisms (SNPs) [23], microsatellite regions [24] and the aforementioned sequence motifs and structural characteristics of these regions. However, due to biological variability within the genome sequences [25–27], we must also consider that sets of genomic regions are composed of a heterogeneous population of sequences, each with a unique profile of characteristics. While not all these characteristics are enriched and/or important for creating a profile for the whole-region groups, it is possible that a combination of many factors is responsible for separating groups of genomic regions and/or determining the binding of a protein to that particular region.
We present BioFeatureFinder (BFF), a flexible and unbiased algorithm for the discovery of distinguishing biological features associated with groups of genomic regions. In addition, BFF can help discover how these characteristics interact with each other. This is useful for creating an accurate map of the features that are more important for explaining differences between the input genomic regions and the remaining regions of the genome. To achieve this, we applied machine-learning strategies that are already widely used in other transcriptomics, genomics and system biology studies [21,28–31]. Instead of analyzing the genomic regions as individual data points, we analyzed the cumulative distribution functions drawn from the population of regions for each of the features described above and then applied binary classification algorithms to identify which characteristics are more important for group separation. This strategy has already been applied in other studies [32] but is here applied for the first time in the context of classification of features associated with groups of genomic regions on this scale. Furthermore, we aimed to develop a flexible tool that can use data from multiple public databases, such as UCSC GenomeBroswer [33], Ensembl [34], GENCODE [35], ENCODE [36] and others, and is capable of performing in an unsupervised and unbiased way.
Results and Discussion
For our BFF algorithm, we define “biological features” as the set of characteristics that can be used to distinguish regions from other sections of the genome. These features can include but are not limited to nucleotide content, length, conservation, k-mer occurrence, presence of SNPs, protein-binding sites, microRNA target sites, methylation sites, microsatellite, CpG islands, repeating elements, and protein domains.
Our algorithm can analyze the distribution of values for each feature in a set of genomic regions of interest, compare this distribution to a randomized background to identify which features represent the most distinguishing characteristics associated with the input dataset, and then rank them by importance values. This tool can be used as an important information source for scientists, who can use the data provided to generate new and more accurate hypotheses or guide wet-lab experiments more efficiently. BFF can also be used in large-scale computational projects that can analyze hundreds of datasets with ease and produce consistent results.
For the first time, we apply Big Data strategies in an unbiased way, thereby effectively reducing observer bias, to take advantage of the large amounts of data produced by high-throughput experiments, such as CLIP/CHIP/RNA/DNA-seq, and data deposited in publicly available databases (e.g., UCSC GenomeBroswer [33], Ensembl [34], GENCODE [35], ENCODE [36]) to extract a set of significant information from genomic regions and uncover the latent relationships inside the datasets. First, we present the framework used by BFF in its analytic process, with an overview of the input data types, workflow and output. Second, as a control, we applied BFF to the RBP (RNA-binding protein) RBFOX2 eCLIP-seq (enhanced crosslink immunoprecipitation RNA-sequencing) data because this protein is widely studied, and its binding sites are well characterized in the literature [37–40]. Finally, to showcase potential applications of the algorithm, we analyzed 112 eCLIP datasets obtained from human cell lines that are available in the ENCODE database and identified the biological features associated with the binding sites of all RBPs and their respective importance scores.
The BioFeatureFinder workflow
BioFeatureFinder focuses on flexibility, consistency and scalability. It is python-based with scalable multi-thread capabilities, is memory-friendly and compatible with most commonly used UNIX-based systems (e.g., CentOS, Ubuntu, openSUSE and macOS) and can be used with a wide range of hardware, from notebooks to HPC clusters. In Figure 1, we show a schematic representation of the BFF workflow. The types of inputs required to use the algorithm are as follows: a set of BED coordinates with genomic regions of interest (e.g., CLIP/CHIP-seq binding sites, promoter regions for differentially expressed genes, splice sites for alternatively spliced exons/introns and others); and compatible fasta files with sequences (e.g., Reference genome transcriptome). However, for increased accuracy, it is also possible to use a GTF/GFF file with region annotations (exons, introns, CDS, UTR and others). These can be optionally used to increase the number of features that are analyzed using BED files with genomic regions of biological features (e.g., microRNA sites, methylation sites, CpG islands, protein binding sites, SNPs and mutations, repeating elements and multiple bigWig files with phastCons scores for multiple alignment).
The analytical process of the algorithm is divided into two sub-sections: Build datamatrix and Analyze features. Building the datamatrix starts with selecting an appropriate background to compare the input regions of interest, which are obtained using the shuffle function of bedtools. Although it is not required, the use of a reference annotation improves the algorithm’s accuracy by guiding the included/excluded background regions. The total number of background regions (B) is proportional to the number of regions in the input list (I) of bed coordinates, which can be represented by the following formula: B = I * N, where N is an integer variable that can be set as an option (default = 3; i.e., the number of background regions is 3 times the number of input regions). These two sets of regions are then used to produce a datamatrix. Each bed entry in the regions is converted into a line in the matrix, and each feature corresponds to a column. Every feature is represented as numeric value, which can be a continuous, discrete or Boolean variable. To obtain these values, BFF uses multiple freely available tools such as bedtools [41] for the nucleotide content and counting intersections with features in bed format, bigWigAverageOverBed [42] for extracting conservation scores, EMBOSS wordcount [43] for k-mer counting, Vienna’s RNAfold [44] for RNA secondary structure MFE (minimum free energy) values and QRGS Mapper [45] for G-quadruplex scoring. Designed with a modular concept, new functions and sources of data can be easily added by researchers to answer project-specific questions.
Once the matrix is created, the algorithm runs a two-step analysis to identify important features in the dataset. The first step is to analyze each feature in the matrix with a two-sample Kolmogorov-Smirnov test (KST), with implementation by SciPy [46], to compare the distribution of values of the regions of interest (group 1) to background regions (group 0). In addition to using a statistical tool to identify the significant features, it is possible to use KST as a tool for feature selection by extracting statistically significant features that are correlated with differences between the groups, a strategy that has been shown to improve the classification performance of high-dimensional data [47–49]. As an additional benefit, filtering the features by KST p-values reduces the size of the datamatrix used in the following classification step, which can be helpful for reducing both the computational time and the resources used in the analyses. The second step involves the use of a Stochastic GradientBoost Classifier (St-GBCLF) from Scikit-learn [50] that can naturally handle mixed datatypes, is fairly robust to outliers and possesses reliable predicative power. Additionally, this method has been shown to be preferable for high-dimensional two-class prediction [51,52]. Additionally, similar to other ensemble methods, St-GBCLF is less likely to suffer from overfitting [53]. This stage will use the feature values extracted from the matrix (which can be filtered, or not, by KST) for each group (0, background and 1, input) and calculate the feature importance, a score that measures how valuable each feature is in the decision-making process of the trees [54]. Higher importance values indicate that the feature is considered in key decisions and can thus be inferred to have more biological significance. To address the class imbalance problems inherent to these types of analysis [52,55], our algorithm draws a random sample from the background (group 0), which is the same size as the input dataset (group 1), thereby increasing the overall accuracy of the classifier. However, to address biological variability, our algorithm performs multiple classification runs, drawing a new sample of background regions each time. The final classification score is calculated as the average of the importance values obtained in each classification run.
Both classification and statistical results are compiled into a table, which allows easy interpretation. Additionally, graphical representations of each feature are output in both the cumulative distribution function (CDF) and kernel density estimation (KDE) plots. This allows the visualization of the distributions found in the input and background and leads to conclusions about how the distribution is shifted from the reference. Additionally, classifier importance and KS test values are output in bar charts. Finally, classifier performance is measured by several parameters: accuracy, sensitivity, sensibility, positive predictive value, negative predictive value, adjusted mutual information (aMI), mean squared error (MSE) value, receiver operating characteristic (ROC) and precision-recall (P-R) area under curve. These metrics are output in both table (with scores) and graphical (bar charts and curves) format. Together, these outputs can be used to explore the data and identify features that may be of significance in a biological context.
Analysis of RBFOX2 eCLIP dataset
To evaluate the performance of our algorithm, we analyzed the subset of RNAs binding regions bound by the RNA binding protein RBFOX2. We used public eCLIP experiments deposited in the ENCODE database. Within this set, we found 922 statistically significant biological features that were used in the classification step (Figure 2A). The classifier provided an overall average of accurate predictions of 9 out of 10 times. The scores were 91% mean accuracy score, 88% positive predictive value (P.P.V.), 93% negative predictive value (N.P.V.), 93% sensitivity, and 89% specificity. We also obtained average scores for adjusted mutual information (aMI) and mean squared error (MSE) of 0.56 and 0.09, respectively. Both the receiver operating characteristic (ROC) and Precision-Recall (P-R) area under curve (AUC) were measured at 0.97 (Figure 2A, Additional File 1). Among all of the statically significant features, our approach identified 16 features with a relative importance score of at least 10% (i.e., 1/10 of the importance score of the highest scoring feature, Figure 2B, Additional File 2), rediscovering known literature-reported features and novel features. As a known feature, we found conservation of the binding site and enrichment of the canonical GCAUG k-mer; and within novel features, we found higher GC-content in the binding sites, lower MFE for RNA secondary structure, a higher G-quadruplex score and overlap of binding sites with DDX6 and other RBPs.
We identified enrichment of the GCAUG 5-mer as a major feature that characterized the RBFOX2 eCLIP binding sites by both KS test (p-value < 0.001) and variable importance in classification. Our analysis indicated that 33% of the binding sites identified in the RBFOX2 eCLIP contained at least 1 repetition of the GCAUG motif, whereas only 6% randomized background regions exhibited at least 1 instance of this motif (Figure 3A). Additionally, we found significant enrichment of the UGCAUG 6-mer, which occurred in 22% of binding sites, in contrast to 2% of the background regions (Figure 3B). Both results are consistent with the RBFOX2 nucleotide sequence motif enrichment and occurrence in RNA binding sites [37,40], indicating that our algorithm successfully recovered known features.
Interestingly, we identified several major components associated with the RNA secondary structure of the RBFOX2 binding sites. GC content had the second highest importance value of all features, with RBFOX2 binding sites exhibiting a higher distribution of GC than randomized background regions and most of the binding sites having a range of 50% to 80% GC content in their sequences, whereas the background regions were more evenly distributed (between 20% to 60% -Figure 4A). It is known that RNA regions with higher GC content correlate to more stable secondary RNA structures [56], with alterations in splicing patterns through an effect on pre-mRNA secondary structure [57], a known mechanism for RBFOX2 splicing regulation [37]. Although the importance of RNA secondary structure as a guiding factor for RBP binding was shown before [13,58,59], our data pointed to RBFOX2 because we identified that the minimum free energy (MFE) for RNA folding is a major feature for distinguishing protein binding sites from a randomized background. We identified 70% of binding sites with an MFE lower than 0, indicating the possible existence of a localized secondary structure, while only 47% of background regions exhibited similar behavior (Figure 4B). Additionally, we also identified the presence of G-quadruplexes, a specific type of secondary structure, as enriched in RBFOX2 binding sites. Our analysis indicates that 53% of RBFOX2 binding sites had a positive score for their presence, in contrast to only 15% of background regions (Figure 4C).
This result is particularly interesting because although this feature was not previously associated with RBFOX2, it can be supported extensively by literature evidence. First, RBFOX2 has been described as a member of the RG/RGG family of RNA-binding proteins, with an RNA recognition region rich in arginine-glycine rich regions that is known as RGG-box [60]. Second, other proteins from this family were shown to bind to RNA G-quadruplex by their RG/RGG regions [61]. Third, the existence of G-quadruplexes in intronic regions can have impacts on alternative splicing regulation [62,63]. Taken together, these results indicate that secondary RNA structure may play a larger role in RBFOX2 targeting for binding sites than previously assumed, combined with the existence of the GCAUG motif for increased accuracy in target selection. This is further evidenced by the fact that 50% of the binding sites containing GCAUG were also positive for the presence of G-quadruplexes (Figure 4D).
Finally, the most important feature, an association previously unreported in the literature, represents the overlap of RBFOX2 binding sites with DDX6 binding sites, an RNA helicase. We identified 18% of RBFOX2 peaks with at least 1 nucleotide position in common with DDX6 binding sites, which is significantly higher than the value obtained for randomized background regions that scored almost 0% of overlap (Figure 5A). Although this association is novel, it is known that DDX6 homologs in S. cerevisiae interacts with EFTUD2 homolog, another RBP identified as an important feature for RBFOX2, which suggests that the interaction between these proteins is evolutionarily conserved. [64]. We also identified several other RPBs with significant overlap with RBFOX2, including known splicing regulators and/or components of the spliceosome such as HNRNPM, EFTUD2, PRPF8, QKI, HNRNPK and PCBP2 (Additional File 2).
Using data available from BioGrid 3.4 [65] and STRINGdb 10.5 [66], we found that these targets were associated with RBFOX2 through a curated protein-protein interaction network (Figure 5B), with RBFOX2 directly interacting with QKI [67–70] and HNRNPK [70]. The latter, in turn, has been shown to interact with EFTUD2 [71].
Taken together, our results indicate the existence of a combinatorial mechanism of both RNA structure and nucleotide sequence to direct binding specificity to RBFOX2. However, our analysis is limited to identifying the similar and diverging characteristics of their binding sites. For a deeper understanding of the relationship between RBFOX2 binding features, further experiments are required. They would contribute to the comprehension of target binding specificity and would shed light on novel biological functions for RBFOX2.
Identification of important features for 110 RNA-binding protein-binding sites from ENCODE
To showcase the potential applications of BioFeatureFinder in high-throughput studies, we applied our algorithm to the large dataset of 110 eCLIP-seq available at ENCODE. First, we identified the preferential binding regions for each RNA Binding Protein (RBP) in the dataset (Figure 6, Additional File 6), with our results indicating that 59% of the analyzed proteins had preferential binding to the intronic regions. The second most frequent region was 3’UTR (15%), followed by CDS (15%), 3’ splice site (7%), 5’ splice site (2%) and the 5’UTR (2%). Among the identified preferential regions for the RBPs, some were already known from the literature (such as U2AF1 [72], U2AF2 [72], SF3A3 [73], PRPF8 [74], FMR1 [75], PUM2 [76], TIA1 [77], TARDBP [78] and RBFOX2 [37]), which demonstrates that our algorithm correctly identified their binding region preferences. We used this information to generate the appropriate background for each RBP. Overall, our algorithm performed consistently, with an average accuracy of 0.9 and average ROC and Precision-Recall AUCs of 0.95. The largest variance was obtained for the aMI (adjusted mutual information) scores, with an average of 0.57 and standard deviation of 0.16 (Figure 6B, Additional File 2). We also observed a strong correlation (Pearson’s R² ≥ 0.95) between aMI (Adjusted Mutual Information, Figure 7A) and MSE (Mean Squared Error, Figure 7B) scores with overall classifier Accuracy, indicating that RBPs with higher aMI scores tended to reach a higher degree of resolution of the binding site features. This can be inferred to be a consequence of the binding characteristics of the RBPs. For example, TARDBP has a strict set of characteristics that guide their binding to specific targets, whereas other RBPs, such as SF3B1, appear to have a higher degree of flexibility in their binding target selection (Additional File 2).
Overall, we identified three major classes of features that were important for the determination of binding site selection for this group of RBPs: K-mer enrichment (motifs), secondary RNA structure and overlap with other RBPs (Figure 8A, Additional File 7). All 110 RBPs have at least one of these as an important feature for the classification of their binding site with 10% or more relative importance. Furthermore, we found that most of the RBPs (56%) have a combination of these three factors as important features for the determination of binding site specificity. We identified 107 (out of 110) RBPs with some degree of overlap with at least 1 other RBP, which reflects the characteristic of RBPs working in protein complexes to perform biological functions [79,80]. We recovered information for known protein complexes, such as FMR1-FXR1-FXR2 [68,81] (Figure 8B), identifying 68% of FXR1 binding sites with overlap with FXR2 binding sites and 58% with overlap with FMR1 binding sites. Interestingly, the reciprocal did not hold true, with only 19% of FMR1 and 22% of FXR2 binding sites having an overlap with FXR1 (see Additional File 4), which might reflect the molecular dynamics involved in the formation of the complex [81]. In addition, we identified overlaps in the binding sites of RBPs without any previous association reported, such as the AGGF1, which had 40% overlap with GTF2F1, (see Additional File 4). While this information may indicate that they only bind to the same targets in similar positions, it could also suggest the existence of some biological relationship between proteins that is yet to be uncovered.
Analysis of K-mer enrichment revealed that 74 RBPs had at least 1 K-mer with 10% or more relative importance (see Additional File 7); although this represents the majority of analyzed RBPs (66%), it also shows that a nucleotide sequence is not a requirement for directing the binding of all RBPs to their targets. We managed to recover several well-known examples from the literature, such as TARDBP’s GUGU repeats [82], which are present in 88% of binding sites (Figure 8C). Other known examples include QKI [83] (ACUAA in 57% and UAAC in 68% of binding sites), PUM2 [75] (UGUA in 73% of binding sites), PTBP1 [84] (UCUU, 80%), HNRNPC [85] (UUUU, 63%), HNRNPK [86] (CCCC, 87%), KHDRBS1 [87] (UAAA, 80%) and TIA1 [88] (UUUU, 42%). In addition, our algorithm identified motifs for other 47 RBPs, which were found in approximately 30% of binding sites and had an at least 15% difference compared to the background (Table 1, see Additional File 3).
Finally, our algorithm also identified 74 RBPs with secondary RNA structure (either by lower Minimum Free Energy, MFE, calculated by Vienna’s RNAfold, or by a higher G-quadruplex score, from QGRS Mapper) as an important feature for classification (see Additional File 7). As an example, the increased G-quadruplex score for EWSR1 binding sites, which is a known RBP that binds to these types of RNA structures [89], with 62% of the sites exhibiting a positive score, whereas only 16% of the background regions showed the same behavior (Figure 8D). Another RBP that we identified as a binding secondary RNA structure is XRN2, which had 86% of binding sites with an MFE score lower than 0, whereas only 51% of background regions had values lower than 0. This particular RBP is reported to bind R-loop structures formed by G-rich pause sites associated with transcription termination [90], in accordance with our findings for motif enrichment, as we found that XRN2 had 70% of its binding sites containing a GGGG 4-mer, while only 18% of its background regions had the same 4-mer (Table 1). Other known examples from the literature that we recovered include: FMR1, also known to bind to G-quadruplexes [61,91] and DDX3X, DDX6, DDX24 and DHX30, which are RNA-helicases. In addition, we also identified 34 RBPs with positive G-quadruplexes and 15 percentage points or more of difference compared to the background. Of these, 20 RBPs also exhibited an enrichment of GG repeats in their binding sites (Figure 9, Table 1, Additional File 5), which is a known characteristic for these structures [92]. They are: AARS, AKAP8L, CDC40, DDX3X, DDX42, DDX6, DGCR8, FASTKD2, FKBP4, GRSF1, HNRNP, NKRF, SLTM, SRSF1, SRSF9, SUPV3L1, TAF15, XRCC6, XRN2, and ZRANB2.
Conclusion
Our results show that BioFeatureFinder is an accurate, flexible and reliable analysis platform for large-scale datasets while simultaneously providing a method to control observer bias and uncover latent relationships in biological datasets. By considering each genomic landmark as a separate data point in a distribution, we developed a novel implementation. This method combines statistical analysis and Big Data machine-learning approaches to provide accurate representations of the differences in sets of genomic regions and identify the characteristics that contribute more to the separation of these groups.
As demonstrated by our analysis of the RBFOX2 dataset, our algorithm managed to recover multiple characteristics known from the literature, including nucleotide sequences for binding motifs, and infer protein-protein interaction from the overlaps between binding sites. In addition, we uncovered new associations that might link RBFOX2 to the targeting of specific RNA secondary structures to increase RBFOX2 binding specificity, a hypothesis that is strengthened by inferences from the literature from multiple sources.
Furthermore, our analysis of 110 RNA-binding proteins’ CLIP-seq data from ENCODE recovered several well-known features from the literature, including major characteristics that influence the targeting of these proteins. The results for this dataset indicate RNA-target selection by RNA-binding proteins as a multi-factorial mechanism, which demands the existence of both cis-and trans-regulatory factors to increase the RBP affinity to the target site. Among those features are important factors, such as the existence of a particular set of nucleotide sequences (binding motif); accessibility of the target site via RNA secondary structure; and the neighboring RBP context (i.e., other proteins binding to neighboring/same region), which all contribute to determining whether a particular RBP will bind to its target site. Additionally, our results suggest that the binding of RBPs to targets is heavily dependent on the cellular context, with some proteins relying on fewer features for directing their binding specificity (i.e., the presence of a sequence is sufficient for recognition by the RBP), while other proteins require a more complex targeting context, with multiple features involved in the binding of the RBP (i.e., requiring a specific sequence, nearby accessory proteins and a specific RNA structure). Together, our results not only deepen our knowledge of how these proteins select their targets in a broader scenario but also demonstrate how our approach can be applied to large-scale datasets from high-throughput experiments with a high degree of reproducibility.
Although the present study focused on the applications of BioFeatureFinder for RNA-binding proteins, our algorithm can be applied to any type of genomic landmark. Some examples of regions that could be analyzed using BFF include: splicing sites for alternatively spliced exons (or whole exons), target sites for microRNAs, binding sites for DNA-binding proteins (for example, ChIP-seq data), promoter regions from differentially expressed genes, microsatellite/genomic markers, SNPs, whole transcript regions (5’UTR, 3’UTR and CDS) and any type of dataset that can be converted into a BED format.
Materials and methods
Extraction of information about biological features and RNA-binding proteins binding sites
We downloaded tracks from biological features associated with genomic features from the UCSC Genome Browser [33] for the human genome hg19, downloading tracks for conservation scores (phastCon scores in bigWig format), benign and pathological CNVs, common and flagged SNPs, TS microRNA target sites, CpG islands, layered H3K4Me1/3 and H3K27Ac and microsatellites. Additionally, we obtained data for 112 RNA-binding proteins (RBP) available from ENCODE [36] eCLIP experiments (Additional File 8: Table S1), downloading the bed files containing the narrowPeaks obtained for hg19 (Additional File 8: Table S2). Additionally, our algorithm is integrated with BedTools [41] (intersect, getfasta and nuc functions), UCSC’s bigWigAverageOverBed [42], EMBOSS wordcount [43], Vienna’s RNAfold [44] and QRGS Mapper [45]. Unless otherwise stated, all tracks were either downloaded or converted into BED format. We used GENCODE’s [35] GRCh37.p13 as a reference genomic sequence, along with release 19 of the comprehensive annotation.
Preferential region identification and background selection
To identify preferential binding regions for each dataset analyzed, we separated the transcripts into 6 major regions: 5’UTR, 3’UTR, CDS, introns, 5’ splice sites and 3’ splice sites. We then used bedtools intersect to count how many RBP binding sites occurred in each region. These values were normalized by Z-score, and the highest scoring was selected as the preferential binding region. A randomized background was generated using bedtools shuffle (with -excl, -incl, -chrom and -noOverlapping options), using the GTF reference containing the preferential binding region (or regions) to guide the selection of regions, excluding overlaps with input regions (binding sites) and other randomized background regions. For RBPs with small differences in binding site z-scores (less than 10%) in their highest-scoring regions, we selected the 2 highest scoring regions and used both as references for background generation. For each RBP dataset, we generate a randomized background region with three times the number of input regions (binding sites).
Assembly of a data matrix with biological features
The genomic regions (binding sites and background) and their associated biological features are converted to a numerical matrix, where each line is one region and each column is a biological features associated with that position. To convert biological features in numerical data, we used several freely available software. For most features, we use BedTools intersect (-s and -c options) to count the number of occurrences of that feature in the corresponding region. To obtain nucleotide sequence information, we used a combination of BedTools getfasta and nucBed (both with -s option). For the conservation score we used the tool bigWigAverageOverBed to obtain the average conservation score of covered bases in the region. For k-mer analysis, we used EMBOSS wordcount to count the number of occurrences of each 4-mer, 5-mer and 6-mer in each region. For RNA structure analysis, we used both Vienna’s RNAfold to calculate the lowest possible MFE (with -g option) and QGRS Mapper to calculate the maximum non-overlapping G-quadruplex score for each region. All operations were performed while considering strandedness.
Group selection and statistical analysis of features
For analysis, the input and background regions are separated into groups (1 and 0, respectively) using the unique identifier created during the datamatrix assembly and stored as the “name” field in the generated BED file. For all features in the matrix, we performed a Kolmogorov-Smirnov test by comparing the cumulative distribution function of the input regions (group 1) to the background (group 0) to filter out non-significant features between the groups. Features with a q-value ≤ 0.05, as adjusted by Bonferroni, were selected forfurther analysis using the classification algorithm. This filtering aims to provide significantly different features between the groups and to minimize the noise introduced by non-significant features while simultaneously reducing the computational time required for the classification step and the overall required time for analysis.
Classifier and feature importance estimation
To evaluate the importance of each feature’s ability in separating the genomic region groups, we chose to use a stochastic gradient boost classifier python implementation from Scikit-learn [50]. The classifier was used with the following parameters: number of estimators = ‘1000’; learning rate = ‘0.01’; max depth = ‘8’; loss = ‘deviance’; max features = ‘sqrt’; minimum number of samples per leaf = ‘0.001’; minimum number of samples to split = ‘0.01’; random state = ‘1’ and subsample = ‘0.8’. Importance values for each feature are calculated at every run, with the final value representing the mean scores and their corresponding standard deviation. The same scoring strategy is employed for relative importance score (percentage relative to the most important feature), accuracy, positive predictive value, negative predictive value, sensitivity, sensibility, ROC and Precision-Recall AUCs.
Availability
The BioFeatureFinder software is available for download at GitHub [93] and is included as Additional file 9 for archival purposes.
Abbreviations
- aMI
- Adjusted mutual information
- AUC
- Area under curve
- BFF
- BioFeatureFinder
- CDF
- Cumulative distribution function
- CDS
- Coding sequence
- ChIP-seq
- Chromatin Immunoprecipitation sequencing
- CLIP-seq
- Cross-linking Immunoprecipitation sequencing
- CpG
- 5’—C—phosphate—G—3’
- DNA
- Deoxyribonucleic acid
- eCLIP-seq
- enhanced Cross-link Immunoprecipitation sequencing
- FTD
- Frontotemporal dementia
- GFF
- General feature format
- GTF
- General transfer format
- KDE
- Kernel density estimation
- KST
- Kolmogorv-Smirnov Test
- MFE
- Minimum free energy
- MSE
- Mean squared error
- N.P.V.
- Negative predictive value
- P.P.V.
- Positive predictive value
- P-R AUC
- Precision-Recall area under curve
- RBP
- RNA-binding protein
- RNA
- Ribonucleic acid
- ROC AUC
- Receiver operating characteristic area under curve
- SNP
- Single-nucleotide polymorphism
- St-GBCLF
- Stochastic gradient boost classifier
- UTR
- Untranslated region
Declarations
Acknowledgments
Funding was received from the São Paulo Research Foundation (FAPESP grants 12/00195-3, 13/50724-5). Felipe E. Ciamponi was recipient of a FAPESP fellowship (14/25758-6, 15/25134-5, 16/25521-1). Dr Mario H. Bengtson and Dr Marcelo Brandão who provided valuable scientific insights.
Competing interests
The authors declare they have no competing interest.
Data availability
All data used in this study is publicly available from ENCODE database and the developed software and raw results are available on GitHub.
Author contribution
FEC, MTL and KBM conceived the project and designed its overall goals. FEC and MTL designed the algorithm and wrote the Python code. FEC implemented classification algorithms for biological datasets and performed the analysis of the eCLIP datasets. PRSC assisted in developing, debugging, testing and benchmarking the algorithm. MTL and KBM coordinated the project. FEC wrote the manuscript. All authors read and approved the final manuscript.
Supplementary material
Additional File 1: Deviance, ROC and PR curves for RBFOX2
Additional File 2: Classifier Metrics and Results for all 112 RBPs
Additional File 3: K-mer enrichment scores and percentages Additional File 4: Overlap percentages
Additional File 5: Structure percentages
Additional File 6: Heatmap with Z-score for binding region preference
Additional File 7: RBP classification for Kmer/Struct/Overlap and combinations
Additional File 8: List of accession numbers for RBPs from ENCODE
Additional File 9: BioFeatureFinder v1.0 algorithm
Footnotes
Mailing address: Center for Molecular Biology and Genetic Engineering - CBMEG, University of Campinas, Campinas, SP, Brazil. Av Candido Rondo, 400 Cidade Universitária CEP 13083-875, Campinas, SP Phone: 55-19-98121-937
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.
- 27.↵
- 28.↵
- 29.
- 30.
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.
- 39.
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵