Sigma70Pred: A highly accurate method for predicting sigma70 promoter in prokaryotic genome

Sigma70 factor plays a crucial role in prokaryotes and regulates the transcription of most of the housekeeping genes. One of the major challenges is to predict the sigma70 promoter or sigma70 factor binding site with high precision. In this study, we trained and evaluate our models on a dataset consists of 741 sigma70 promoters and 1400 non-promoters. We have generated a wide range of features around 8000 which includes Dinucleotide Auto-Correlation, Dinucleotide Cross-Correlation, Dinucleotide Auto Cross-Correlation, Moran Auto-Correlation, Normalized Moreau-Broto Auto-Correlation, Parallel Correlation Pseudo Tri-Nucleotide Composition, etc. Our SVM based model achieved maximum accuracy 97.38% with AUROC 0.99 on training dataset, using 200 selected features. In order to check the robustness of our model we have tested our model on the independent dataset made by using latest version of RegulonDB10.8, which included 1134 sigma70 and 638 non-sigma70 promoters, and able to achieve accuracy of 90.41% with AUROC of 0.95. We have developed a method, Sigma70Pred, which is available as webserver and standalone packages at https://webs.iiitd.edu.in/raghava/sigma70pred/. The services are freely accessible.


Introduction
Promoters and enhancers regulate the fate of a cell and the expression of genes.
Promoters are generally located upstream of genes' transcription start sites (TSS) responsible for switching on or off the respective gene. In prokaryotes, promoters are recognized by the holoenzyme, which is made up of RNA polymerase and a related sigma factor. There are various types of sigma factors responsible for different functions, such as sigma54 controls the transcription of genes responsible for the modulation of cellular nitrogen levels, sigma38 controls stationary phase genes, sigma32 regulates heat-shock genes, and sigma24 and sigma18 controls the extra-cytoplasmic functions [1]. The number associated with each sigma factor represents the molecular weight. Sigma70 factor is a crucial factor as it regulated the transcription of most of the housekeeping genes. Sigma70 promoter comprises two well-defined short sequences located at -10 and -35 base pairs upstream of TSS, known as pribnow box and -35 region [2]. It is essential to classify the promoters in a genome as it can aid in illuminating the genome's regulatory mechanism and disease-causing variants within cis-regulatory elements. The area of promoters is of great interest as people pay great attention to their importance not only in developmental gene expression but also in environmental response. Due to the advancement in sequencing technology, the data is growing exponentially, and hence the classification of the promoter region is a crucial problem because the standard procedures are expensive in terms of time, and performance [3,4].
In the past, ample of methods have been developed for predicting sigma70 promoters.
IMPD [5], is based on increment of diversity, which achieved an accuracy of 87.9%. This method was trained on RegulonDB [6] dataset that contains 741 E. coli sigma70 promoters.
Z-curve-based approach [7] attains the maximum accuracy of 96.1% by using a smaller dataset that comprises 576 sigma70 promoters and 825 non-sigma70 promoters. PseZNC [8] is based on a multi-window Z-curve approach and gained the maximum accuracy of 84.5% using the dataset from RegulonDB9.0 [6]. 70Propred [9] has incorporated features like position-specific trinucleotide propensity based on single-stranded characteristic (PSTNPss) and electron-ion potential values for trinucleotides (PseEIIP), and reported the 95.56% accuracy.
In the present study, we have developed a computational method called as Sigma70Pred, to classify the sequences in sigma70 promoter and non-promoter. In this study, dataset used for benchmarking is same as used in 70Propred. One of the objectives of this study is to improve the prediction performance of models on a large and recent dataset. A web server and python and docker-based standalone software have been developed to serve the scientific community for predicting the sigma70 promoters.

Dataset generation
In order to train and test our models using cross-validation, we obtained training dataset from RegulonDB9.0 [6]. It contains 741 sigma70 promoters and 1400 non-promoters, and each sequence is of length 81. The same data has been used previously by 70Propred. In order to validate our model on external dataset or independent dataset, we have extracted the dataset from RegulonDB 10.8, which comprises 1134 sigma70 and 638 non-sigma70 promoters. There is no identical sequence in training and independent dataset. The datasets can be downloaded from our server.

Overall workflow
The comprehensive workflow for Sigma70Pred is shown in Figure 1.

Model development
In this study, we developed models for predicting sigma70 promoters using wide range of machine learning techniques such as decision tree (DT), random forest (RF), knearest neighbor (KNN), extreme gradient boosting (XGB), gaussian naïve bayes (GNB), and support vector machine (SVM) [14]. We got the best performance using SVM based model.
Our best model on training dataset was evaluated on independent dataset (obtained from RegulonDB 10.8).

Five-fold cross-validation
In order to avoid the biasness and test the prediction models' performance, we have implemented five-fold cross-validation. In this approach, the complete dataset is divided into five parts, the model is trained on four out of five parts, whereas the model is tested on the left part, and the performance is recorded. The same process is iterated five times so that each part gets the chance to be used for the purpose of testing. The overall performance is calculated by taking the mean of all five iterations [15].

Measures of performance
To assess the performance of generated prediction models, we have used various threshold-dependent and independent parameters. We have considered sensitivity that is, percent of sigma70 samples classified correctly; specificity that is, percent of non-sigma70 samples classified as negative; accuracy that is, percentage of samples which are correctly predicted by the model; and Matthews correlation coefficient (MCC) that explains the relationship between the observed and predicted value, under threshold-dependent parameters, whereas, in threshold-independent measures, we have considered Area Under the where, P T refers to number of true positives; P F refers to number of false positives; N T refers to number of true negatives; and N F refers to number of false negatives.

Compositional analysis
In order to assess the proportion of the nucleic acids in the sigma70 promoter and non-promoter, we have calculated the mono-nucleotide composition. As shown in Figure 2, nucleic acid adenine and thymine are abundant in sigma70 promoter sequences, whereas cytosine and guanine are higher in percentage in the case of non-promoter sequences.

Percent Composition
Nucleic acid composition in sigma70 promoter and non-promoter

Performance comparison with existing methods
There are ample of methods which are trained and evaluated on the same benchmark dataset such as, 70ProPred [9], iPro70-FMWin [10], PseZNC [8], Z-Curve [7], and IPMD [5]. We have compared the performance of Sigma70Pred with existing prediction methods for sigma70 promoters prediction and found out that Sigma70Pred has outperformed all the existing methods, as shown in Table 2.

Performance comparison on independent dataset
In order to evaluate the method's robustness and performance, we have also performed testing of our model on the independent dataset of DNA sequences extracted from Regulon DB 10.8, using various existing methods. The results on testing data show that our model is quite robust towards the unseen data and performs well on it. It also implies that our SVM model is significantly free from bias and overfitting on training data. As shown in Table 3, two out of four methods are not able to produce the results and Sigma70Pred outperforms the iPro70-FMWin.

Implementation of model in web server
In order to serve the scientific community, we have also developed the webserver Sigma70Pred by implementing our best model to predict the sigma70 promoters. The web server consists of three modules namely "Predict," "Scan," and "Design." The detailed description of each module is as follows:

Predict
This module allow users to classify the submitted sequence as sigma70 promoter or nonpromoter. There is a restriction of length in tis module as the model is trained on sequences with length 81, hence if the submitted sequence is have length less than 81, 'A' will be added as the dummy variable and then the sequence will be classified into one of the class, and if length is greater than 81, only first 81 nucleotide will be considered for prediction. The user can submit sequences in either FASTA or single line format, and can select the desired threshold as SVM score above which the sequence will be classified as sigma70 promoter, otherwise non-promoter. The user can either provide single or multiple sequences, user can also upload the text file containing sequences. The output page display the results in the tabular form, which is downloadable in the csv format.

Scan
Scan module allow users to scan or identify the sigma70 promoter region in given prokaryote genome. This module does not have any length restriction as in predict module. In this module, overlapping patterns of length 81 will be generated from submitted sequences and then used for prediction. The user can provide single or multiple sequences either in FASTA or in single line format. The user is also allowed to upload the sequence file. The output result will exhibit the overlapping patterns of length 81 with the prediction as promoter or non-promoter. The result is downloadable in the csv format.

Design
Design module allow users to identify the mutations that can convert the sigma70 promoter into non-promoter or vice-versa. This module also has the restriction of sequence length 81, as it generates all the possible mutants by changing nucleotides at each position and then make the predictions based on the selected threshold. Since, generating all possible mutants is a time and computational expensive process, hence only one sequence is allowed at a time.
The output page displays all the possible mutants with its prediction as promoter or nonpromoter in tabular form which is downloadable in csv format.

Standalone
We have also developed python and docker-based standalone package, which is downloadable from URL: https://webs.iiitd.edu.in/raghava/sigma70pred/stand.html. The advantage of this module is that, it is not dependent at the availability of the internet, the user can download these standalones on their local machines and can use all the aforementioned modules. This module also take the input as single or multiple sequences in a file in either FASTA or single line format. The output will be stored in the user defined file in the comma separated value format.

Conclusions
Sigma70Pred offers a web server and standalone packages to predict the sigma70 promoters using sequence information. This method uses 200 different features, and we assume that our features have more capability to classify sigma70 promoters. Sigma70Pred provide three major modules, such as predict, scan and design. As the application of out method, user can scan the entire prokaryote genome to identify sigma70 promoter, using scan module. By using design module, user can also identify the minimum number of mutations required to exploit the sigma70 promoter region, i.e. either induce or deteriorate the capability of sigma70 promoter. As compared to the existing methods of predicting sigma70 promoters, Sigma70Pred produced commending outcomes. We believe that Sigma70Pred will play an essential role in the area of genomic analysis.