EnhancerP-2L: A Gene regulatory site identification tool for DNA enhancer region using CREs motifs

Enhancers are DNA fragments that do not encode RNA molecules and proteins, but they act critically in the production of RNAs and proteins by controlling gene expression. Prediction of enhancers and their strength plays significant role in regulating gene expression. Prediction of enhancer regions, in sequences of DNA, is considered a difficult task due to the fact that they are not close to the target gene, have less common motifs and are mostly tissue/cell specific. In recent past, several bioinformatics tools were developed to discriminate enhancers from other regulatory elements and to identify their strengths as well. However the need for improvement in the quality of its prediction method requires enhancements in its application value practically. In this study, we proposed a new method that builds on nucleotide composition and statistical moment based features to distinguish between enhancers and non-enhancers and additionally determine their strength. Our proposed method achieved accuracy better than current state-of-the-art methods using 5-fold and 10-fold cross-validation. The outcomes from our proposed method suggest that the use of statistical moments based features could bear more efficient and effective results. For the accessibility of the scientific community, we have developed a user-friendly web server for EnhancerP-2L which will increase the impact of bioinformatics on medicinal chemistry and drive medical science into an unprecedented resolution. Web server is freely accessible at http://www.biopred.org/enpred.


Introduction
In cellular biology, regulation of transcription is performed to recruit elongation factors or RNA polymerase II initiation. This is mainly achieved at specific sequences of DNA by binding transcriptional factors (TFs).
Transcription initiation sites are harbored by promoter regions which are the most studied sites in DNA [1]. Some DNA sequences have multiple transcription factor binding sites and are near or far away from promoter regions.
Such DNA segments are denoted as enhancers [2], [3]. The transcription of genes is boosted by enhancers which influence various cellular activities such as cell carcinogenesis and virus activity, tissue specificity of gene expression, differentiation and cell growth, regulation and gene expression and develop relationship between such processes very closely [4].
Enhancers can be a short (50-1500bp) segment of DNA and situated 1Mbp (1,000,000bp) distance away from a gene. Sometimes they can even exist in different chromosomes [5], [6]. On the other hand, promoters are located near the start of the transcription sites of a gene. Due to this fact of locational difference between promoters and enhancers, the task of enhancer's prediction is highly difficult and challenging than promoters [7]. Many human diseases like inflammatory bowel disease, disorder and various cancers have been linked to this genetic variation in enhancers [8]- [11].
About 30 years ago, a DNA segment, which was the first characterized enhancer, increased the transcription of βglobin gene during a transgenic assay inside the virus genome of SV40 tumor [12]. Scientific research during recent past has discovered that enhancers have many subgroups such as weak and strong enhancers, latent enhancers and poised enhancers [13]. Prediction of enhancers and their subgroups is an interesting area of research as they are considered important in disease and evolution. In higher classification of eukaryotes, transcription factor repertoire, diverse in nature, binds to enhancers [14]. This process of binding orchestrates many cellular events that are critical to the cellular system. Some of the events that are coordinated through this binding are maintenance of the cell identity, differentiation and response to stimuli [15], [16].
In the past, purely experimental techniques were being relied upon for the prediction of enhancers. Pioneering works in enhancer prediction was proposed in [4], [17]. The former was to use combinations such as transcription factor, P300 [18], with enhancers to identify them. This method would usually under-detect or miss the concerned targets and resulted in high failure rates due to the fact that all enhancers do not have transcription factor occupations. The latter was to utilize DNase I hypersensitivity for enhancer predictions. Hence, leading to high false positive rate as many other DNA segments, which were non-enhancers, detected incorrectly as enhancers. Although, genome-wide mapping techniques of histone modifications [1], [19]- [23] could improve the aforesaid deficiencies in the prediction of promoters and enhancers, but they are time consuming and expensive.
Several bioinformatics tools have been developed for rapid and cost effective classification of enhancers in genomics. CSI-ANN [21] used data transformations efficiently to formulate the samples and predict using Artificial-Neural-Network (ANN) classifications. EnhancerFinder [1] incorporated evolutionary conservation information features into sample formulation combined with a multiple kernel learning algorithm as a classifier. RFECS [23] applied random forest algorithm for improvements in detection methods. EnhancerDBN [24] used deep belief networks for enhancer predictions. BiRen [25] increased the predictive performance by utilizing deep learning based method. By utilizing these bioinformatics tools, enhancer detection can be achieved by the research community.
Formed using many different large sub-groups of functional elements, enhancers can be grouped as weak, strong, inactive and poised enhancers. iEnhancer-2L [26], the first ever predictor to detect enhancers and identify their strengths and was based on sequence information only. Pseudo K-tuple nucleotide compositions (PseKNC) based features were incorporated into iEnhancer-2L. It has been used in many analysis related to genomics increasingly.
Furthermore, many other methods, such as EnhancerPred [27] and EnhancerPred_2.0 [28], were introduced to improve the performance by incorporating other features based on DNA sequences. iEnhancer-5Step [29] was recently developed using the hidden information of DNA sequences infused with Support Vector Machine (SVM) based predictions. However, improvement in the performance of the aforementioned predictors is still required.
Specifically, the success rate of discriminating strong and weak enhancers is not up to the expectations of the scientific community. The current study is initiated to propose a method which would deal with this problem.

Benchmark Dataset
The benchmark dataset of DNA enhancer sites, originally constructed and used in recent past [26], was re-used in the proposed method. In the current dataset, information related to nine different cell lines (K562, H1ES, HepG2, GM12878, HSMM, HUVEC, NHEK, NHLF and HMEC) was used in the collection of enhancers and 200bp fragments were extracted from DNA sequences. To remove pairwise sequences from the dataset, CD-HIT [30] tool was used to remove sequences having more than 20% similarity. The benchmark dataset includes 2,968 DNA enhancer sequences from which 1,484 are non-enhancer sequences and 1,484 are enhancer sequences. From 1,484 enhancer sequences, 742 are strong enhancers and 742 are weak enhancers for the second layer classification.
Furthermore, the independent dataset used by a recent study [29] was utilized to enhance the effectiveness and performance of the proposed model. The independent dataset included 400 DNA enhancer sequences from which 200 (100 strong and 100 weak enhancers) are enhancers and 200 are non-enhancers. Table.1 and Table.2 includes the breakdown of the benchmark dataset.
In statistical based prediction models, the benchmark dataset mostly include training datasets and testing datasets.

Feature Extraction
Today, one of the core requirements in the development of an effective bioinformatics predictor is the need to formulate the biological sequence with a vector or a discrete model without losing any key-order characteristics or sequence-pattern information. The reason for this fact, as explained in a comprehensive state-of-the-art review [31], that the existing machine-learning algorithms cannot handle sequences directly but rather in vector formulations.
However, there exists some possibility that all the sequence-pattern information from a vector might be lost in a discrete model formulation. To overcome the sequence-pattern information loss from proteins, Chou proposed pseudo amino acid composition (PseAAC) [32]. In almost all areas of bioinformatics and computational proteomics [31], the Chou's PseAAC concept has been widely used ever since it was proposed. In the recent past, three publicly accessible and powerful softwares, 'propy' [33], 'PseAAC-Builder' [34] and 'PseAAC-General' [35] were developed and the importance and popularity of Chou's PseAAC in computational proteomics has increased more ever since. 'PseAAC-General' calculates Chou's general PseAAC [36] and the other two softwares generate Chou's special PseAAC in various modes [37]. The Chou's general PseAAC included not only the feature vectors of all the special modes, but also the feature vectors of higher levels, such as "Gene Ontology" mode [36], "Functional Domain" mode [36] and "Sequential Evolution" mode or "PSSM" mode [36]. Using PseAAC successfully for finding solutions to various problems relevant to peptide/protein sequences, encouraged the idea to introduce PseKNC (Pseudo K-tuple Nucleotide Composition) [38] for generating different feature vectors for DNA/RNA sequences [39], [40] which proved very effective and efficient as well. In recent times a useful, efficient and a very powerful web server called 'Pse-in-One' [41] and its recently updated version 'Pse-in-One2.0' [42] were developed that are able to generate any preferred feature vector of pseudo components for DNA/RNA and protein/peptide sequences according to the requirements of study by any researcher.
In this study, we utilized the Kmer [43] approach to represent the DNA sequences. According to Kmer, the occurrence frequency of 'n' neighboring nucleic acids can be represented from a DNA sequence. Hence, by using the sequential model, a sample of DNA having 'w' nucleotides is expressed generally as eqs. (2) is represented as the first nucleotide of the DNA sample S, as the second nucleotide having the 2 nd 1 2 position of occurrence in DNA sample S and so on so fourth denotes the last nucleotide of the DNA sample. 'w' is the total length of the nucleotides in a DNA sample. The nucleotide can be any four of the nucleotides which can be represented using the aforementioned discrete model. The nucleotide can be further described using eqs. (3) Here is the symbol used to represent the set theory 'member of' property and 1 ≤ v ≤ n. The components that ∈ are defined by the aforementioned discrete model utilize relevant nucleotides useful features to expedite the extraction methods. These components are further used in statistical moments based feature extraction methods.

Statistical Moments
Statistical moments are certain types of quantitative measures that are used for the study of the concentrations of some key configurations in a collection of data used for pattern recognition related problems [44]. Several properties of data are described by different orders of moments. Some moments are used to reveal eccentricity and orientation of data while some are used to estimate the data size [45]- [48]. Several moments have been formed by various mathematicians and statisticians based on famous distribution functions and polynomials. To explicate the current problem, Raw; Central and Hahn moments are utilized [49].
The moments that are used in calculations of mean, variance and asymmetry of the probability distribution are known as raw moments. They are neither location-invariant nor scale-invariant. Similar type of information is obtained from the Central moments, but these central moments are calculated using the centroid of the data. The central moments are location-invariant with respect to centroid as they are calculated along the centroid of the data, but still they remain scale-variant. The moments based on Hahn polynomials are known as Hahn moments. These moments are neither location-variant nor scale-invariant [50]- [53]. The fact that these moments are sensitive to biological sequence ordered information amplifies the reason to choose them as they are primarily significant in extracting the obscure features from DNA sequences. The use of scale-invariant moment has consequently been avoided during the current study. The values quantified from utilizing each method enumerate data on its own measures. Furthermore, the variations in data source characteristics imply variations in the quantified value of moments calculated for arbitrary datasets. In the current study, the 2D version of the aforementioned moments is used and hence the linear structured DNA sequence as expressed by eqs. (2) is transformed into a 2D notation. The DNA sequence, which is 1D, is transformed to a 2D structure using row major scheme through the following eqs. (4): Where the sample sequence length is 'z' and the 2-dimensional square matrix has ' ' as its dimension. The ordering obtained from equation (4) is used to form matrix M (eqs. (5)) having 'm' rows and 'm' columns.
The transformation from M matrix to square matrix M' is performed using the mapping function 'Ʀ'. This function is defined as eqs. (6) The raw moments of 2D matrix M, with order (j + k) and up to a degree of 3, are computed using the eqs (7). The The centroid of any data is considered as its center of gravity. The centroid is the point in the data where it is uniformly distributed in all directions in the relations of its weighted average [54], [55]. The central moments are also computed up to degree-3, using the centroid of the data as their reference point, from the following eqs.(8): The degree-3 central moments with 10 distinct features are labeled as , , , The centroids and are calculated from eqs.(9) and eqs. (10) The aforementioned Pochhammer symbol (₽) was defined as follows in eqs.(12):

DNA-Position-Relative-Incident-Matrix (D-PRIM)
The DNA characteristics such as ordered location of the nucleotides in the DNA sequences are of pivotal significance for identification. The relative positioning of nucleotides in any DNA sequence is considered core patterns prevailing the physical features of the DNA sequence. The DNA sequence is represented by D-PRIM in (4 x 4) order. The matrix in eqs. (17) is used to extract position-relative attributes of every nucleotide in the given DNA sequence.
Here the indication score of the th position nucleotide is determined using with respect to the x th nucleotide

DNA-Reverse-Position-Relative-Incident-Matrix (D-RPRIM)
It often happens in cellular biology that the same ancestor is responsible for evolving more than one DNA sequence.
These cases mostly outcome homologous sequences. The performance of the classifier is hugely effected by these homologous sequences and hence for producing accurate results, sequence similarity searching is reliable and effectively useful. In machine learning, accuracy and efficiency is hugely dependent on the meticulousness and

Frequency-Distribution-Vector (FDV)
The distribution of occurrence of every nucleotide was used to compute the frequency distribution vector. The frequency distribution vector (FDV) is defined as in eqs. (19):

= {µ ,…,µ 4 }
Here is the frequency of occurrence of the i th (1 ≤ i ≤ 4) relevant nucleotide. Furthermore, the relative positions of µ nucleotides in any sequence is alleviated using these measures. The miscellany Super Feature Vector (SFV) includes these 4 features from FDV as unique attributes.

D-AAPIV (DNA-Accumulative-Absolute-Position-Incidence-Vector)
The distributional information of nucleotides was stored using frequency distribution vector which used the obscure

Random Forests
In the past, ensemble learning methods have been applied in many bioinformatics relevant research studies [56], [57] and have produced highly efficient outcomes in measures of performance. Ensemble learning methods utilize many classifiers in a classification problem with aggregation of their results. The two most commonly used methods are boosting [58], [59] and bagging [60] which perform classifications using trees. In boosting, the trees which are successive, propagate extra weights to points which are predicted incorrectly by the previous classifiers. The weighted vote decides the prediction in the end. Whereas, in bagging, the successive trees do not rely on previous trees, rather, each tree is constructed independently from the data using a bootstrap sample. The simple majority vote decides the prediction in the end.
Leo Breiman [61] built random forests, through which an additional layer of randomness is added to bagging. The random forests changed the construction of the classification trees by adding the construction of each tree from the data using a different bootstrap sample. The splitting of each node, in standard classification trees, is performed by dividing each node equally among all the variables. However, in random forests, the splitting of each node is performed by choosing the best among a subset of predictors which are chosen randomly at that node ( Figure 1 shows the structure of the random forest classifier). As compared to many other classifiers, such as support vector machine, discriminant analysis and neural networks, this counterintuitive strategy perform very well and is robust against overfitting [56]. 25. According to [63], 128 is the theoretical upper limit of the trees and any further increase in the number of trees will not contribute to improve accuracy. One of the key findings observed during the experimentation process was that forest with more than 25 trees minimally contribute to the accuracy of the classifier, but can enhance the overall size of the proposed model substantially. Figure 2 illustrates a flowchart to show the overall process of the proposed method.

Experiments and Results
For the assessment and verifications of the model and to analyze its performance, some methods are used to evaluate them. These methods evaluate the classifiers using inspection attributes which are based on the outcomes of classification assessments and estimates.

K-fold cross validation (KFCV) technique is most commonly used by practitioners for estimation of errors in
classifications. Also known as rotation estimation, KFCV splits a dataset into 'K' folds which are randomly selected and are equal in size approximately. The prediction error of the fitted model is calculated by predicting the k th part of the data which is dependent on other K-1 parts to fit the model. The error estimates of K from the prediction are combined together using the same procedure for each k = 1, 2, … , K.
Although the generalization performance of any classifier is mostly estimated using unbiased approximations in jackknife tests, two drawbacks exists in this test, firstly, the variance is high as estimates used in all the datasets are very similar to each other, secondly, its calculative expensive as n estimates are required to be computed, and the total number of observations to test is n in the dataset. The 5-fold and 10-fold cross validation tests are proven to be a good compromise between computational requirements and impartiality.
In the KFCV tests, the selection of 'K' is considered as a significant attribute. To testify errors in prediction models, cross validations (K = 5 & K = 10) tests have been used in many research studies. 5-Fold and 10-Fold tests proved to have accurate results in our proposed model and proved to be much better than state-of-the-art methods. These results are listed in Table 5 and Table 6.

Evaluation Parameters
The problems of binary classifications use metrics such as Accuracy (Acc), Sensitivity (Sn), Specificity (Sp) and Mathew's Correlation Coefficient (MCC) for measuring the proposed prediction model quality and efficiency.
These metrics are defined in the following eqs (24): Here true-positives (TP), TN (true-negatives), FP (false-positives) and FN (false-negatives) represent the outcomes from the cross validation tests. Unfortunately, the conventional formulations from the above mentioned metrics in eqs(24) lack in intuitiveness and due to this fact, understanding these measures especially MCC, many scientists have faced difficulties. To ease this difficulty, the above conventional equations were converted by Xu [64] and Feng [65] using Chou's four intuitive equations which used the symbols introduced by Chou [66]. The symbols that define these equations are; . The description of these symbols is defined in Table 3. Y + , Y -,Y + -and Y -+ Table: 3 Description of symbols used to define these equations.

Symbols
Description of Symbols + The total number of true enhancers + - The total number of true enhancers incorrectly predicted as non-enhancers -The total number of true non-enhancers -+ The total number of non-enhancers predicted as enhancers From the above correspondence in Table 4, we can define eqs. (25): From the above correspondence in Table 4, we can define eqs.
The above eqs. (26) has the same meaning as the equation (24) but it is more easy to understand and intuitive. Table  Table:  The overall prediction is not a better than any other random prediction outcome.
The set of metrics used in above Table 4 are not applicable to multi-labeled prediction models rather they are only useful for single labeled-systems. A different set of metrics exists for multi-labeled-systems which have been used by various researchers [67]- [69]. The comparison of existing classifiers with proposed method is mentioned in Table.6.

Results and Discussions
The classification algorithms with their predictions results using benchmark dataset are shown in Table 5 accuracies respectively. In contrast, 91.68% and 84.53% accuracy was achieved in predicting enhancers and their strength respectively by the currently proposed method after utilizing obscure features from statistical moments and random forest classifications using 5-Fold cross validation tests (see Table 5 and Figure.3 for ROC). Furthermore, 10-fold cross-validation test was also conducted using random forest classifier on benchmark dataset and obtained the accuracy results are listed in Table 6. The ROCs 10-fold cross-validation tests are shown in Figure.4, 5, 6, 7, 8 and Figure.9. In addition to cross validation tests, an independent test was also performed using the independent dataset. The comparison of proposed model and state-of-the-art methods using independent dataset is listed in Table   7. Our proposed method is based on the feature sets that are evaluated using Hahn moments which are easier for the random forest based classifier to classify the feature vectors in acute time and are very efficient as compared to previous methods which were not able to produce better results on the computational cost of training and testing using classification process.

Web-Server
As observed in past studies [72]- [76], the development of a web-server is highly significant and useful for building more useful prediction methodologies. Thus, efforts for a user friendly webserver have been made in past [77]- [81] to provide ease to biologists and scientists in drug discovery. The webserver which has been developed for the proposed method is freely accessible at http://www.biopred.org/enpred which is developed using Flask 1.1.1. The step-wise instructions to utilize the system are mentioned as follows: Fig.10: The GUI of www.biopred.org/enpred webserver for enhancer predictions

Step-1
Open your web browser and navigate to www.biopred.org/enpred. The first page (see Figure.10) of the webserver is Home, page from where you can proceed to ReadMe, Server, Data and Citations pages through provided navigation links. The Server tab navigates to the portal for prediction process. Data tab facilitates the user with links to download the benchmark datasets used during the training and testing process. Finally, the Citations tab redirects the user to the page where information about the relevant paper and its citation is provided. In order to perform the prediction, kindly click on the Server tab.

Step-2
On the Server page, the user is provided with an empty text-area where the user can input the DNA Enhancer or DNA non-Enhancer sequence for prediction. The sequence input to the webserver is required to be in FASTA format. The Submit button will proceed to the prediction process for the input sequence. The results of the prediction process will appear on the Results page. The time of the prediction process totally depends upon the length of the input sequence.

Step-3
On the Data page, the user is provided with links to download the benchmark datasets for future experimentations.

Step-4
On the ReadMe page, the user is provided with relevant information about the current model which includes the details about the operation algorithm used for predictions.

Conclusions
In the proposed research, an efficient model for predicting the enhancers and their strength using statistical moments and random forest classifier is developed. In recent past, many methods were proposed to predict enhancers, but our method has proved to be better in accuracy than the existing state-of-the-art methods. Our method achieved accuracies of 91.68% and 84.53% for enhancer and strong enhancer classifications on a benchmark dataset which is currently the highest and accurate classification method for prediction of enhancers and their strength.

Conflict of interest
All authors have declared that they have no conflict of interest.

Ethical approval
This article does not contain any studies involved with human participants or animals performed by any of the authors.