Nfeature: A platform for computing features of nucleotide sequences

In the past few decades, public repositories on nucleotides have increased with exponential rates. This pose a major challenge to researchers to predict the structure and function of nucleotide sequences. In order to annotate function of nucleotide sequences it is important to compute features/attributes for predicting function of these sequences using machine learning techniques. In last two decades, several software/platforms have been developed to elicit a wide range of features for nucleotide sequences. In order to complement the existing methods, here we present a platform named Nfeature developed for computing wide range of features of DNA and RNA sequences. It comprises of three major modules namely Composition, Correlation, and Binary profiles. Composition module allow to compute different type of compositions that includes mono-/di-tri-nucleotide composition, reverse complement composition, pseudo composition. Correlation module allow to compute various type of correlations that includes auto-correlation, cross-correlation, pseudo-correlation. Similarly, binary profile is developed for computing binary profile based on nucleotides, mono-nucleotides, di-/tri-nucleotide properties. Nfeature also allow to compute entropy of sequences, repeats in sequences and distribution of nucleotides in sequences. In addition to compute feature in whole sequence, it also allows to compute features from part of sequence like split, start, end, and rest. In a nutshell, Nfeature amalgamates existing features as well as number of novel features like nucleotide repeat index, distance distribution, entropy, binary profile, and properties. This tool computes a total of 29217 and 14385 features for DNA and RNA sequence, respectively. In order to provide, a highly efficient and userfriendly tool, we have developed a standalone package and web-based platform (https://webs.iiitd.edu.in/raghava/nfeature).

In order to supplement previous efforts, we have made a systematic attempt to developed a webserver platform "Nfeature" that integrates most of the features discovered in the past along with the incorporation of new features. In this study, we have introduced new features Nucleotide Repeat Index (NRI), Distance distribution of Nucleotides (DDN) and Entropy at sequence level as well as nucleotide level as a novel feature for DNA/RNA sequences. We have also incorporated Binary profile-based features for a given nucleotide sequence. These features are essential for motif predictions, factors/enhancers binding sites, etc. Using these modules, user can easily calculate the binary fingerprints of each nucleotide in a given sequence. Nfeature is a comprehensive platform to fetch all relevant information from a given nucleotide sequence in the form of vectors, which can be directly used for developing prediction models. A user-friendly web server and standalone package have been developed to facilitate users in computing features of nucleotide sequences (Figure 1).

Composition/distance distribution-based features
This module aims to calculate nucleotide composition-based features in nucleotide sequences. It enables users to compute nucleic acid composition, distance distribution of nucleotides (DDN), nucleotide repeat index (NRI), pseudo composition and entropy of a sequence. We have incorporated most of the features used in previous studies. In addition, we integrate new features like entropy where we compute entropy at sequence and at nucleotide level. Several past studies have shown that nucleotide repeats have important biological functions. For example, repeated DNA residues are essential for the expression of unique coding sequence which further form nucleoprotein complexes. Whereas, in some cases these repeated sequences causes biological disorders like "GGGGCC" repeat sequences in the C9orf72 gene causes neurodegenerative disease (Chanou & Hamperl, 2021;De Bustos et al., 2016;Handy et al., 2011;Malik et al., 2021;Miret et al., 1997;Rajewska et al., 2012;Shapiro & von Sternberg, 2005)  l., ue repeat and distance distribution feature generation methods for the protein sequences (Akshara Pande, 2019). Hitherto, no method capture such information of nucleotides from nucleotide sequences. In this study, first time we used NRI to calculate the repeating nucleotide information of DNA/RNA sequence. It measures the number of continuous runs of a nucleotide in a biological sequence. To calculate the distance distribution of nucleotide residues, a module is used for computing distance distribution information for a nucleotide sequence. Residue repeat and distance distribution of nucleotide sequence can be calculated by using the given formula: Here, NRI i and DDN i stands for nucleotide repeat index and distance distribution of nucleotide type i, where N stands for the maximum number of occurrences, W j stands for the number of repeats in occurrence "j" for nucleotide type "i", W NT -nucleotide distance from N-terminal; W j -Inter-distance between nucleotides "i"; W CT -Nucleotide distance from C-terminal, L-Total length of nucleic acid sequence; F i -Frequency of nucleotide type "i".
Shannon entropy plays a significant role in the field of information theory. Recently several studies have shown the importance of entropy in DNA sequences for example, to investigate exons and introns in the DNA sequences (Li et al., 2019), to identify DNA sequence diversity between different alleles within one individual (Sherwin, 2010). To the best of our knowledge there is no method incorporating entropy-based features. So, to get the Shannon entropy information of DNA and RNA sequences, we first time introduce an entropy-based module which computes entropy of DNA/RNA sequences at residue as well as sequence level. Sequence and residue level entropy is calculated using following equations 3, and 4.
Here "i" is the nucleotide in the sequence and X is any nucleotide sequence, and p i is the probability of a given nucleotide in the sequence.

Correlation based features
In this module, correlation-based features of DNA and RNA sequences are calculated.
Correlation is defined as a relation between properties/features i.e. if a feature variable is related to its own then it is defined as auto-correlation and if there exists some correlation between two features/variables then it is known as cross-correlation. Correlation based features basically convert the different length DNA and RNA sequences into fixed length vectors, so that machine learning techniques can be applied to the extracted features. These descriptors identify features based on nucleotide properties along the sequence.

Binary profile-based features
This module covers binary profile-based features of a given DNA/ RNA sequence. Binary profile-based features are important to motif predictions, factors/enhancers binding sites, etc.
Using these modules, users can compute binary equivalents of each nucleotide in a given sequence. Generally, overlapping windows are used to create all possible patterns of fixed length for a given sequence. Then each pattern is converted to a binary profile (1 is

Nucleotide-Based Features
In this module, three sub-modules were involved for DNA and RNA, such as binary profile for mono-nucleotide (BPM), dinucleotide (BPD), and trinucleotide (BPT). In BPM, each nucleotide is represented by the vector of length four; for instance, A is defined as (1,0,0,0); G is designated as (0,1,0,0); C is indicated as (0,0,1,0), and T in case of DNA and U in case of RNA, is referred as (0,0,0,1). Similarly, for BPD, each dinucleotide is represented by the vector of size 16 with one element as 1, and the rest are zeros, such as dinucleotide AA is indicated as (1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0). Likewise, in BPT, each trinucleotide is represented by the vector of size 64. This module can compute 840 descriptors of a single DNA/RNA sequence with length of at least 10 residues.

Web Implementation and Standalone Package
In order to facilitate scientific community, we have developed a user-friendly webserver named as "Nfeature". Webserver was integrated using Apache software on Linux/Ubuntu operating system. All the web pages have been developed with the help of HTML, CSS3 and PHP5. It is compatible with number of devices such as smart phone, laptop, Desktop, iPad. The submit page of server permit the user to submit nucleotide sequences (DNA/RNA) in FASTA format. The result page of web server allows the user to download the output in csv format. Figure 1 represents the description of all the modules of Nfeature tool ( Figure 2). Additionally, the standalone package incorporates a readme file, description manual and separate codes for both DNA and RNA modules in respective directories.

Comparison with other tools
In Table 1, we have compared Nfeature with the existing software/web servers based on the platform compatibility, package development and current running status. We showed that most of the software's and packages are either not working or are not platform compatible with most of the frequently used operating systems. Nfeature and BioSeq-Analysis 2.0 are found to be available as a working web server and standalone package. Both of the tools were found to be compatible with widely used platforms such as Windows, Linux, and Mac OS.  Table 2 and has added novel features. This platform provides new feature generation methods such as Nucleotide Repeat Index, Distance Distribution, Sequence level Entropy, Nucleotide level Entropy and Binary profiles of inputs DNA/ RNA sequences as represented in Table 3.

Discussion and Conclusion
Due to advancement in technology in next generation sequencing, databases like NCBI (Coordinators, 2016), GenBank (Benson et al., 2013), EMBL (Stoesser et al., 2002), INSDC -DDBJ (Cochrane et al., 2016) are growing with exponential rate. In order to address numerous unsolved biological questions, there is an urgent need to develop computer-aided tools to annotate new sequences in above databases. In order to annotate any sequence, most important step is computation of numerical vector that represent characteristics of a sequence. In simple term computation of features or descriptors of a sequence is an important and essential step for computing function or structure of a sequence. In the past various package and web-based platform has been developed to compute wide range of features of proteins and nucleotide sequences. For example, pseudo K-tuple nucleotide composition (PseKNC), method used to generate composition and few correlation-based features from DNA/RNA sequences (Chen et al., 2014). RepDNA and RepRNA have been developed for calculating various features for DNA and RNA sequences respectively (Liu et al., 2015). BioTriangle is a web server that generate features for chemicals, proteins and nucleotide sequences and their interactions (Liu et al., 2016).
They have reported to calculate 14 type features from DNA/RNA sequences.
PyBioMed also allows to generate features for chemicals, proteins nucleotides (Dong et al., 2016). This package generates compositions, autocorrelation and pseudo nucleic acid composition-based feature vectors. BioSeq-Analysis (Liu, 2019), which is recently updated to a new version, known as BioSeq-Analysis 2.0  of the same was published including 26 features at the residue level and 90 features at the sequence level. This web server also develops the prediction models based on the feature generation technique, but the web server is very complex to use and not much user friendly. In addition, it doesn't compute entropy-based features, Nucleotide repeat Index and Distance distribution of DNA/RNA sequences, unlike Nfeature. On the other hand, Nfeature integrates all existing features available for DNA/RNA sequences (Table 1) and adds eight novel features for better understanding of the sequence insights as represented in Table 2.
In summary, a number of tools have been developed to compute various features based on DNA/RNA sequences. These tools provide a unique set of features to cater different user requirements. Aim of developing Nfeature was to complement existing tools and to provide all possible feature generation techniques at a single platform in a user-friendly mode. Also, in order to overcome the limitations of the existing tools we have integrated all the existing tools features for DNA and RNA sequences with few new features in Nfeature platform. Overall, it is a comprehensive, easy-to-use web server/standalone package that allow users to calculate various features.

Funding Source
The current work has not received any specific grant from any funding agencies.