Abstract
Motivation LncRNAs are much more versatile and are involved in many regulatory roles inside the cell than previously believed. Existing databases lack consistencies in lncRNA annotations, and the functionality of over 95% of the known lncRNAs are yet to be established. LncRNA transcript identification involves discriminating them from their coding counterparts, which can be done with traditional experimental approaches, or via in silico methods. The later approach employs various computational algorithms, including machine learning classifiers to predict the lncRNA forming potential of a given transcript. Such approaches provide an economical and faster alternative to the experimental methods. Current in silico methods mainly use primary-sequence based features to build predictive models limiting their accuracy and robustness. Moreover, many of these tools make use of reference genome based features, in consequence making them unsuitable for non-model species. Hence, there is a need to comprehensively evaluate the efficacy of different predictive features to build computational models. Additionally, effective models will have to provide maximum prediction performance using the least number of features in a species-agnostic manner.
It is popularly known in the protein world that “structure is function”. This also applies to lncRNAs as their functional mechanisms are similar to those of proteins. Generally, lncRNA function by structurally binding to its target proteins or nucleic acid forming complexes. The secondary structures of the lncRNAs are modular providing interaction sites for their interactome made of DNA, RNA, and proteins. Through these interactions, they epigenetically regulate cellular biology, thereby forming a layer of genomic programming on top of the coding genes. We demonstrate that in addition to using transcript sequence, we can provide comprehensive functional annotation by collating their interactome and secondary structure information.
Results Here, we evaluated an exhaustive list of sequence-based, secondary-structure, interactome, and physicochemical features for their ability to predict the lncRNA potential of a transcript. Based on our analysis, we built different machine learning models using optimum feature-set. We found our model to be on par or exceeding the execution of the state-of-the-art methods with AUC values of over 0.9 for a diverse collection of species tested. Finally, we built a pipeline called linc2function that provides the information necessary to functionally annotate a lncRNA conveniently in a single window.
Availability The source code is accessible use under MIT license in standalone mode, and as a webserver (https://bioinformaticslab.erc.monash.edu/linc2function).
Competing Interest Statement
The authors have declared no competing interest.