DeepPlnc: Bi-modal Deep Learning for Highly Accurate Plant lncRNA Discovery

We present here a bi-modal CNN based deep-learning system, DeepPlnc, to identify plant lncRNAs with high accuracy while using sequence and structural properties. Unlike most of the existing software, it works accurately even in conditions with ambiguity of boundaries and incomplete sequences. It scored consistently high for performance metrics while breaching accuracy of >98% when tested across a large number of validated instances. During benchmarking it consistently outperformed all the compared tools and maintained a highly significant lead in the range of 4.6%-10.3% from the second best performing tool (p-value << 0.01). DeepPlnc was used to annotate a de novo assembled transcriptome of a himalayan species where again it suggested its much better suitability for genome and transcriptome annotation purposes than the existing tools. DeepPlnc has been made freely available as a web-server and stand-alone program at https://scbb.ihbt.res.in/DeepPlnc/.


Introductioñ
80% of the genome is commonly transcribed into RNA, of which the majority belongs to noncoding RNAs (ncRNAs) alone [1]. More than 80% of these ncRNAs belong to long non-coding RNAs (lncRNAs) [2]. Though not a clear definition or characterization process exists for them, the ncRNAs having length greater than 200 bases are considered as lncRNAs with short open reading frames (<100 amino acids) which are not translated into functional proteins [3]. LncRNAs are supposed to act as a key modulator for various biological processes [4]. Their involvement is reported in controlling transcription process through enhancers and providing regulatory binding sites is well reported [5,6]. They are also reported to work as sponges to miRNAs and suppress miRNA function by causing deflection in their supposed targeting [7]. The lncRNAs are also found in nucleus and act as major components of nuclear speckles while being a partner in epigenetic and transcriptional controls [8]. In cytoplasm, lncRNAs interact with several RNA binding proteins (RBPs) and control their regulatory dynamics [9].
In plants, the first reporting of lncRNA was done in Soybean for ENOD40 [10]. To this date, the number of experimentally identified lncRNAs in plants is much lesser than those reported for human and mouse, despite the fact that plant genomes are much more complex than animals' with several genomes much larger than the human genome. Unlike animal system based resources, there is a dearth of such resources for plants. The majority of the entries belong to the predicted ones through some rules based approached and lag behind enormously when compared to animals based resources and information. Seeing such huge dependence of lncRNAs annotations in plants upon software alone itself makes a strong case for research on such tools. And when one looks for computational tools available to identify plant lncRNAs, more so for annotation and identification purpose, there is an enormous scope for improvement.
Compared to animals, plant lncRNAs differ even in their transcriptional modes and display much more complexity and diversity with marked poor degree of sequence conservation. This is one big reason for the evident scarcity of reliable tools and resources for plant lncRNA identification which mostly work with animal system identified properties. Some tools based on regular machine learning approaches like CPC [11] and CPC2 [12] have been developed where the inputs from human datasets were used to train the models. Yet, there are some recent tools like PLncPRO [13], RNAplonc [14], PreLnc [15], and CNIT [16] which have trained models on plant datasets exclusively. Coding Potential Calculator (CPC) and its upgraded version, CPC2, employ Support Vector Machine (SVM) models which depend upon four sequence intrinsic features (Fickett TESTCODE score, ORF length, ORF integrity, isoelectric point) to evaluate the protein-coding potential of transcripts, using ORF prediction quality from a BLASTX search based homologybased approach.
PLncPRO is an another tool based on machine learning approach and applies random forest (RF) algorithm while using 71 features, including BLASTX outputs and frame entropy and frequencies of each of the 64 trimers [13]. Another tool, RNAplonc, uses 16 features including sequence and structural free energy and applied REPTree machine learning algorithm [14]. A more recently published tool, PreLnc, uses incremental feature selection method across the traditionally identified features for lncRNAs. PreLnc is applying Pearson correlation co-efficient to reduce the total features and applies five different classifiers' comparisons (logistic regression (LR), SVM, decision tree (DT), RF, and K-nearest neighbor methods (KNN)) to build the models [15]. In addition to these tools, CNIT, an upgraded version of CNCI [17] tool, exhibits high accuracy using XGBoost machine learning algorithm. This tool considers both animal and plant datasets for model building.
CNIT uses sequence based features such as maximum most-likely CDS score, their lengths, their standard deviation scores, and frequency of 64 codons [16].
Apart from the above mentioned machine learning approaches, which depend a lot upon successful manual identification and features extraction by the authors, very recently efforts have been made to involve deep learning methods to identify lncRNAs. In doing so, such deep learning networks extract important and even hidden transient features which would be otherwise hard to get through manual feature extraction approaches. Deep learning achieves this through intensifying the representation power of complex features by multifold [18]. However, very limited development has happened in this direction for plant lncRNA identification. A tool, PlncRNA-HDeep, has been reported very recently which identifies plant lncRNAs applying two alternative deep learning models, a Long Short Term Memory (LSTM) and a Convolution Neural Net (CNN) framework [19] utilizing predicted dataset only for single species. Although, this tools does not provide any prebulit model for classification.
Despite all this progress, a fact remains that there is an enormous scope for improvement for plant specific lncRNA identification. First of all, the existing machine learning approaches now need to move towards deep learning approaches where it would become possible to catch those features which are otherwise difficult to be caught through manual process of feature selection and identification. Traditional features like those based on k-mer and motifs based properties, length, ORF properties, and Kozak sequences may easily fail when sequence boundaries are not clear or shorter incomplete transcripts are reported after assembling. Most of the existing tools report satisfactory scores when tested against some standard benchmarking dataset having full length sequences. Some of these datasets themselves need to be looked into as many instances of them are predicted instances which may lower the quality of training and negatively impact the learning process. As soon as these tools face the real situation where sequence boundaries are not known, be it genomic annotation or annotation of de novo assembled transcriptome. Almost all of them fail, raising a big question mark on their actually applicability.
In the present study, we have developed a bi-modal CNN based deep learning model for lncRNAs identification which parallelly worked upon the nucleotides sequences and structural sequences based information. CNNs have proved themselves tremendously successful in detecting spatial arrangements of patterns which are hard to get recognized otherwise. We expected CNN based deep-learning system to successfully detect such spatial arrangements of patterns in nucleotide sequences as well as structural sequences if they were transformed into pixeled form like images.
Such approach is more important in real case applications like annotation of genomes and de novo assembled transcriptomes where boundaries are not clear and identification needs to be capable enough to work with incomplete sequences. Here, a base frame of 400 bases was used in an overlapping sliding window manner while keeping a cut-off of minimum 200 bases with padding.
The entire system was developed by using experimentally validated instances and the system was tested across four different datasets. The developed tool, DeepPlnc, consistently outperformed the compared recently published tools during a highly comprehensive benchmarking process. The study concludes with a practical application demonstration of the software on the de novo assembled transcriptome data of a Himalayan medicinal plant where it reported its lncRNAs for the first time.

The datasets
In this study, we considered experimentally validated plant lncRNAs as positive instances and fully annotated non-lncRNA (mRNA) as the negative instances. The lncRNA dataset was collected from Ensembl Plants (v51) with known status, PncStress, and PLncDB V2.0 repositories which provided lncRNA data from 51 plant species. The negative examples, i.e. non-lncRNA (mRNA) were downloaded from Ensembl Plants (v51) for corresponding plant species considered for lncRNAs.
On the basis of length, positive and negative datasets were filtered. After filtration, the data was split into train and test sets in the ratio of 50:50. The train datasets were used to build the model and the remaining 50% (test set) was kept untouched during the building of model and were used for performance testing and comparison of the built models. For the performance stability and consistency evaluation, 10-fold independent random trials were performed. Every time, the dataset was split randomly into 50:50 ratio with the first part used to rebuild the model from the scratch and the second part used to test it. In this entire study it was ensured that no instances overlapped between test and train sets in order to ensure unbiased and memory free assessment and model development.
To further evaluate the performance of the DeepPlnc, another dataset was constructed and named as Dataset "B". The Dataset "B" was derived from 16 plant species. This dataset was based on the known and predicted plant lncRNAs. The predicted lncRNAs were found in the similar manner as was done in previously published studies [12,17]. The source for this dataset was from CNIT which had collected lncRNAs with status as "known" from Ensembl plants whereas the predicted lncRNAs were downloaded from PNRD database which is a comprehensive integrated web resource for ncRNAs (http://structuralbiology.cau.edu.cn/PNRD/). The negative instances (mRNAs) for this dataset were downloaded from Ensembl plants (v51). The sequences were randomly selected for the corresponding 16 species considered for the lncRNA data. All the datasets used in the present study have been provided at the DeepPlnc webserver and in Supplementary Table S1.
Both these datasets were further developed into their corresponding chunked datasets whose details are given below elsewhere. The full illustration of the processing of both the datasets is given in
Since each base was converted into a four-dimensional vector, each sequence was vectorized into a matrix with a dimension of 400X4 (Input sequence length X Alphabets of the input). We also used the structural information of these RNA sequences which hold special importance in case of lncRNAs where structural conservation is more expected than sequence conservation. This information of structure for each sequence was obtained in the form of Dot-Bracket representation of the secondary structure of the RNA by using RNAfold of ViennaRNA Package v2.4.18 [29]. We encoded the structure information for each base's corresponding dot-bracket assignment in the sequence with "one-hot" encoding ("(": [1, 0, 0], ".": [0, 1, 0], ")": [0, 0, 1]) with a vector dimension of 400X3 (Input sequence length X Alphabet of Dot-Bracket representation). Here also, padding was done for the shorter input sequences. The one-hot encoded bases and structural information are thus independent of one another. Applying both separately in a bi-modal manner would map the structural information to the sequence information while building the network models. The one-hot encoded metrics were then used as the input for the multi model convolution neural network to train and build the models for lncRNAs identification.
To evaluate the performance of the CNNs, various number of hidden layers were tested and finally two hidden layers were selected in fully connected manner. The number of the nodes across both the dense hidden layers were tuned based on the number of filters used in the convolution layer. All the component layers were optimized for their suitable numbers and component nodes number by iterative additions. Similarly, the kernel size and strides were optimized by trying different values in incremental order. An exponential activation function based single node classification layer was used with Mean Absolute Error loss function to calculate the loss. "Adam" optimizer was used at this point to adjust the weights and learning rates. Adam is an improved stochastic gradient optimizer which automatically adjusts the learning rates, momentum, and regularizers, making it more efficient algorithm. Dropout layer was also used to reduce overfitting in the model. The batch size was set to 100 and the number of epochs was set to 50. The DeepPlnc was implemented in Keras using TensorFlow. All codes were written for Python3. All the hyper-parametric values in the combination were tested and fixed using an in-house developed script to get the best hyperparameter values combination. This entire optimization process was done using two different approaches: Random search optimization and Bayesian optimizations. Full details of convergence to the optimum performance with hyper-parameters optimization is given in Supplementary Table   S2, and Supplementary Figures S1, S2, and S3. The final layer outputs a confidence probability score indicated the confidence of each instance as non-lncRNA or lncRNA. If confidence probability score (Cp) ≥ 0.50 the meant that the corresponding input sequence was identified as lncRNA, else a non-lncRNA for the chunk. For the full length sequences, if the average of the confidence probability of all the sub-sequence chunks for the given sequence was >0.50, the sequence was considered as lncRNA, otherwise as a non-lncRNA.

Performance evaluation criteria
The performance of the developed model was evaluated on the test sets while applying the standard performance metrics. Confusion metrices containing correctly and incorrectly identified test set instances were built for the model. Sensitivity defines the portion of the positive instances which were correctly identified as positives whereas the specificity value describes the portion of negative instances which were correctly identified as negatives. Precision estimates the proportion of positives with respect to the total true and false positives. F1-score was also evaluated which measures the balance between precision and recall. Besides these metrics, Matthew's Correlation Coefficient (MCC) was also considered. MCC is considered as the best metric to fathom the performance where score is equally influenced by all the four confusion matrix classes (true positives, false negatives, true negatives, and false positives) [24]. Better the MCC score, the more robust and balanced model with higher degree of performance consistency.
Performance measures were done using the following equations:

Benchmarking and performance evaluation
To evaluate the DeepPlnc performance in this study, we compared DeepPlnc with five different tools: PLncPRO [13], CNIT [16], PreLnc [15], PlncRNA-Hdeep [19], and RNAplonc [14]. Two different datasets were considered separately for the benchmarking process as already described above (Dataset "A" and Dataset "B"). The four out of the compared five tools viz. PLncPRO, CNIT, PreLnc, and RNAplonc provide pre-built models. Only PlncRNA-HDeep does not provide any pre-built models. To overcome this problem, we built the model as described in its publication [19]. Both PLncPRO and PreLnc provide two different models for lncRNA classification. For PLncPRO, monocot and dicot are the two models available whereas for PreLnc, model built from Zea mays and Arabidopsis thaliana were available for prediction. In DeepPlnc, the sequences are scanned by overlapping windows of 400 bases at a time. For the identification with DeepPlnc, if more than 50% of the 400 bases long overlapping windows scored higher value for lncRNA (Cp ≥ 0.50) then the sequence was labeled as lncRNA, otherwise non-lncRNA. The sequence having length < 200 bases were discarded as they did not fit one of the formally accepted prime criteria to define lncRNAs. Both the considered datasets were tested in the form of full length complete sequences sets as well as randomly chopped sequences to mimic the incomplete sequences.
Besides these benchmarking tests, one more set of benchmarking tests was done where DeepPlnc was trained and tested across the datasets of the compared tools and the obtained performance was compared with the performance values claimed the respective tool.

Data collection and formation of datasets
A total of 9,911 experimentally validated plant ncRNAs sequences were downloaded from Ensembl Plants v51 [20], PncStress [21], and PLncDB V2.0 [22] databases for 51 plant species. Non-lncRNAs (mRNAs) data were downloaded from Ensembl Plants (v51) database for 51 plant species. The ncRNAs were used as positive data and the protein-coding transcript were used as negative data. This data contained 9,911 experimentally validated non-coding sequences. Sequences shorter than 200 nucleotides were discarded, keeping in the view that the minimum length of 200 bases is considered for lncRNAs. After filtration, this data got reduced to 6,563 sequences for 30 species. For non-lncRNA transcripts, 6,563 sequences were randomly obtained for the corresponding species considered for the lncRNA data. This dataset became the foundation data for training the deep learning modules. Initially, the above data was broken into two halves for training and testing datasets creation, containing 3,563 sequences of non-lncRNAs and 3,563 sequences of lncRNAs for training. Remaining 3,000 non-lncRNAs and 3,000 lncRNAs sequences were used purely as the test samples. In this entire study this dataset is referred as Dataset "A". Supplementary were kept and considered as noncoding RNAs. For the rest of the sequences, 6.2% of the total (7,060) for two plant species were reported as "known" sequences which were extracted from CNIT website (http://cnit.noncode.org/CNIT/) and did not overlap with Dataset "A" because it did not belong to the plant species which were considered for the Dataset "A". This way another neutral dataset was created using lncRNAs and non-lncRNAs, referred as Dataset "B" in this entire study.
Supplementary Table S3 depicts the distribution of collected data across the various species in plants for Dataset "B". This dataset was not applied for training purpose and was used exclusively for testing only. Figure 1 provides the illustration about the creations of these datasets.

Parameter optimization and establishment of the final bi-modal CNN
architecture Figure 2 provides the details of the evolved CNN architecture to characterize the lncRNAs in plants.
The framework of the bi-modal CNNs used a 2D CNN layer, one max-pooling layer, and three fully connected layer with one output layer along with two batch normalization layers to control the overfitting during the training process. We used bi-modal CNNs with two different parts which used data encoding by one hot encoding method into binary matrix as shown in Figure 2. In the first part of the bi-modal CNNs (left side of Figure 2a) the nucleotide sequence worked as the input. The    Table S4). 10 times from the dataset, a totally new model was built from the scratch while using randomly selected pairs of train and test sets every time, and the performance was noted. Here also, no overlap existed between the train and test sets in order to ensure no bias and memory. As can be found from the Figure 3(b), the developed system performed consistently good and scored high on all these training and testing combinations where the average accuracy consistently scored above 97% and breached 98% for the complete length input sequences. Also, not much fluctuations in performance was observed and consistently performed around 98% accuracy, underling the robustness and consistency of the approach. As transpires from the scoring obtained for the various metrics, the model performed consistent and well balanced, besides scoring high accuracy across all the test data sets which were having experimentally validated cases of non-lncRNAs and lncRNAs.

Benchmarking: DeepPlnc consistently outperformed all the compared tools
A highly comprehensive benchmarking study was performed where DeepPlnc was compared with five different recently published tools to detect plant lncRNAs: PlncPRO, CNIT, PreLnc, PlncRNA-HDeep, and RNAplonc. Besides this, the benchmarking also considered two different datasets ("A", "B"; already detailed above) to carry out a totally unbiased assessment of performance of these tools on different datasets. Each dataset was further divided into two sub-categories, chunked form containing incomplete RNA sequences of size 400 bases (shorter sequences were padded) and full length sequences. This is important to note that the chunked data also provided a frame to mimic the incomplete/truncated length sequences which usually appear in large number in the de novo assembled transcriptome data as well as provide a realistic treatment to deal with those conditions where boundaries of lncRNAs are not known, like genomic annotations.
All these six software were tested across the two datasets ("A" and "B") where DeepPlnc consistently outperformed all the compared tools for all these datasets, for both of their forms (incomplete chunked sequences and complete sequences), and for all of the performance metrics considered. This must be noted the test set from Dataset "A" was never seen before by the DeepPlnc as it was kept aside exclusively for the validation purpose only. The Dataset B worked exclusively as a totally validation test set as no part of it was used even for training purpose for DeepPlnc. This way it was ensured that both the test sets were totally unseen and fit for unbiased assessment. PLncPRO and PreLnc emerged as the next best AUC scorers. What is also noteworthy besides the high accuracy and F1-score values of DeepPlnc across all the datasets, is the fact that its MCC values always stood out as the highest. DeepPlnc maintained least variability and dispersion of performance scores and continued to display its strong balance in detecting the positive and negative instances with high and similar level of precision. A higher MCC value is considered as the true reflector of the robustness, consistency, balance, and performance as it becomes higher only when all the four metrics of performance measure score high [24].The full details and data for the benchmarking studies are given in Supplementary Table S5. Having such consistent performance lead where accuracy lead ranged from 4.6 to 10% from the second best performing tool is itself is a big margin. The observed performance lead of DeepPlnc was evaluated for significance using Chi Square test on the performance contingency tables having True Positives, False Positives, True Negatives and False Negatives for the compared tools. For all the comparisons, highly significant pvalues were observed (p-value << 0.01), clearly suggesting that the observed performance lead of DeepPlnc was significantly higher than the compared tools. And when it comes to application in the genomic setup, such margin in performance becomes much bigger and significant where a margin of just 1% itself may cause several false cases.
Besides the above mentioned comprehensive benchmarking tests, one more benchmarking test was performed. In this one-to-one comparison, DeepPlnc used the compared tool's dataset to build its models from their train sets and tested its performance on their test sets. This way five times DeepPlnc models were built from the scratch and tested for the respectively compared tools. The observed performance of DeepPlnc was compared with the claimed performance of the compared tools, and for almost all the compared tools, DeepPlnc's performance was superior (Supplementary Figure S4). It was also noted that all these compared software had far bigger gap between sensitivity and specificity values, suggesting even better balanced performance by DeepPlnc. Only PreLnc's claimed accuracy was found slightly higher than DeepPlnc. PreLnc scored just 0.74% higher accuracy than DeepPlnc, but also displayed huge difference of 7.6% between its sensitivity and specificity values, suggesting imbalanced performance by PreLnc. In our previous benchmarking with both datasets, A and B, PreLnc had not even figured in the top three tools in terms of performance. And on the chunked incomplete sequences, its performance was among the bottom two. Thus, it was needed to verify the performance of PreLnc on the datasets used by other tools and compare its performance there with DeepPlnc's performance on the same datasets.
PreLnc was run across the datasets of all the remaining compared tools and its performance was compared with DeepPlnc's performance for the same datasets and compared tools. PreLnc was found outperformed by DeepPlnc across all these four tool's datasets (Supplementary Figure S4 (C)), besides being outperformed during bechnmarking with Datasets A and B.
This way, through a very comprehensive benchmarking across a wide range of datasets, it was established that DeepPlnc consistently performed as well as outperformed all the compared tools.

Rheum australe transcriptome data
To demonstrate the applicability of DeepPlnc, it was run against the recently published transcriptome data of an endangered Himalayan medicinal plant, Rheum australe whose genome sequence is still not known and absolutely no information is there for lncRNAs [25]. and predicted genes (10,282). Besides them, without any significant BLAST hit, a total of 12,110 transcripts stood as the potential candidates for lncRNA annotation. In this study, before going into further analysis, for the sake of clarity we removed the predicted and hypothetical protein coding sequences form the study because there was no experimental evidence that if such sequences really coded proteins or not and fully annotated protein coding sequences. The well annotated non-lncRNAs were kept knowingly as its characterization would provide the extent of error done by the software on the false positive front. A total of 2,991 out of 3,912 non-lncRNAs with minimum length of 200 bases were selected to annotate the non-lncRNAs whereas 921 sequences were discarded due to shorter length. In the similar way, a total of 6,598 transcripts were taken from 12,110 transcripts as input to DeepPlnc with minimum length of 200 bases to annotate the lncRNAs and rest of the 5,512 sequences were discarded. Figure 6 depicts the flow diagram representation of the process of lncRNA identification in Rheum australe and the result at its every step of execution.
After this filtering process we had 2,991 transcripts as non-lncRNAs candidate and 6,598 transcript as potential lncRNAs candidate. A total of 5,954 transcripts out of 6,598 potential lncRNAs candidate were identified as lncRNAs and 644 transcript were identified as non-lncRNAs by DeepPlnc. Similarly, a total of 2,268 transcript were identified as non-lncRNAs out of 2,991 non-lncRNA candidate transcripts whereas 723 transcripts were wrongly classified by DeepPlnc.
However, ~90% of the lncRNA transcripts were classified as the lncRNAs. As implemented in DeepPlnc, sequences having average confidence probability ≥0.50 were classified as lncRNAs and if the average confidence probability was < 0.50 the sequence was classified as a non-lncRNA. In order to evaluate how other lncRNA identification tools characterized the data and how much wrong classifications they did to the well annotated non-lncRNA data, the two good performing tools from the benchmarking study described above, PLncPRO and CNIT, were also used with their default parameters to annotate this transcriptome data (Supplementary Table S6). CNIT classified 5,118 transcript as lncRNAs out of 6,598 potential lncRNAs and 1,480 transcripts as non-lncRNAs.
Whereas, only 1,630 non-lncRNAs were correctly identified out of 2,991 non-lncRNAs candidate.
Similarly, PLncPRO annotated 4,421 transcript as lncRNAs out of 6,598 potential lncRNAs and identified only 1,691 transcript as non-coding transcript out of 2,991 non-lncRNAs candidate. CNIT. The last part of Figure 6 provides the overall result produced by all these tools, where it is very clear that DeepPlnc gives at least half the lesser false characterizations than the other two tools, which is a highly significant margin while annotating any genomic system. These values underlined the levels of confidence one could put on these tools when performing the annotation job. Also, it is possible that some of these transcripts annotated in the root study may not be really coding in actual scenario and some a noncoding pseudogene transcripts as many of them are supposedly incomplete sequences.
Further delving on the transcriptome data was done for expression measures to assess how well the usual observation that lncRNAs are usually lesser expressed than non-lncRNAs. This was reflected by the identified lncRNAs in the transcripts of Rheum australe. The expression level which was measured in Fragments Per Kilobase of transcript per Million mapped reads (FPKM) of all the identified lncRNAs was compared with that of non-lncRNAs were calculated using RSEM [26] (Supplementary Table S7). The expression level of lncRNAs was found indeed lesser than non-lncRNAs at 4°C (t-test p-value 0.01). However, this gap in the expression was not that sharp at 25°C presumably due to lowering of the gene expression at higher temperature which is noticed in this Himalayan plant in response of higher temperature. The lncRNAs were found higher expressed at 4 o C than 25 o C (t-test p-value of 0.007). Since these tests were sensitive to some extreme values due to which mean values get influenced, medians/ranks based non-parametric ranking tests (Mann Whitney test) were also done. All these observation returned much higher significance, suggesting significant differences between lncRNAs expression and non-lncRNAs expression. Significantly higher expression of the lncRNAs was observed at the lower temperature for this plant.

Webserver and standalone implementation
DeepPlnc has been made freely available a very simple to use webserver as well as a stand-alone program. The user needs to paste the RNA sequences in FASTA format into the text box or upload the RNA sequences in FASTA format and then click the submit button for the identification. This data is run through the trained bi-modal CNN which generates a relative probability scores for every position in the sequences. Later, the result page appears from where results can be downloaded in a tabular format and sequence wise distribution of probability score in the form of line plot as well as violin plot is displayed on the result page in an interactive form. The server has been implemented using D3JS visualization library, Python, JavaScript, PHP, and HTML5. Figure 8 provides an overview of DeepPlnc webserver implementation. Besides this, a stand-alone version is also provided at GitHub at https://github.com/SCBB-LAB/DeepPlnc/.

Conclusion
The

Availability of data and materials
All the secondary data used in this study were publicly available and their due references and sources have been provided. All data and information generated/used, methodology related details etc. have also been made available in the Supplementary data files provided along with and also made available through the related open access server at https://scbb.ihbt.res.in/DeepPlnc/. The claimed performance by the respective tools, B) DeepPlnc's performance against the same tool when the tool's datasets were used for training and testing. DeepPlnc outperformed all the tools. C)

Funding
Initially, on its own dataset PreLnc claimed higher accuracy than DeepPlnc. However, when both the tools were compared for their performance on the remaining four tools' datasets one-to-one, it emerged that DeepPlnc consistently outperformed PreLnc not only on Datasets A and B, but also on all the four tool's datasets too.      Compared to other tools, performance of DeepPlnc was found much superior. Its wrong classification for the well annotated RNAs as lncRNAs was the least. Also, it classified most of the potential lncRNAs correctly where also it performed better than others.