Abstract
Motivation Protein coding regions prediction is a very important but overlooked subtask for tasks such as prediction of complete gene structure, coding/noncoding RNA. Many machine learning methods have been proposed for this problem, they first encode a biological sequence into numerical values and then feed them into a classifier for final prediction. However, encoding schemes directly influence the classifier’s capability to capture coding features and how to choose a proper encoding scheme remains uncertain. Recently, we proposed a protein coding region prediction method in transcript sequences based on a bidirectional recurrent neural network with non-overlapping kmer, and achieved considerable improvement over existing methods, but there is still much room to improve the performance. In fact, kmer features that count the occurrence frequency of trinucleotides only reflect the local sequence order information between the most contiguous nucleotides, which loses almost all the global sequence order information.
Results We here present a deep learning framework with hybrid encoding for protein coding regions prediction in biological sequences, which effectively exploit global sequence order information, non-overlapping kmer features and statistical dependencies among coding labels. Evaluated on genomic and transcript sequences, our proposed method significantly outperforms existing state-of-the-art methods.
Availability The source code and the dataset used in the paper are publicly available at: https://github.com/xdcwei/DeepCoding/.
Contact jyzhang{at}mail.xidian.edu.cn
Supplementary information Supplementary data are available at Bioinformatics online.
Competing Interest Statement
The authors have declared no competing interest.