A deep learning framework with hybrid encoding for protein coding regions prediction in biological sequences

Chao Wei; Junying Zhang; Xiguo Yuan; Zongzhen He; Guojun Liu

doi:10.1101/2020.11.07.372524

Abstract

Motivation Protein coding regions prediction is a very important but overlooked subtask for tasks such as prediction of complete gene structure, coding/noncoding RNA. Many machine learning methods have been proposed for this problem, they first encode a biological sequence into numerical values and then feed them into a classifier for final prediction. However, encoding schemes directly influence the classifier’s capability to capture coding features and how to choose a proper encoding scheme remains uncertain. Recently, we proposed a protein coding region prediction method in transcript sequences based on a bidirectional recurrent neural network with non-overlapping kmer, and achieved considerable improvement over existing methods, but there is still much room to improve the performance. In fact, kmer features that count the occurrence frequency of trinucleotides only reflect the local sequence order information between the most contiguous nucleotides, which loses almost all the global sequence order information.

Results We here present a deep learning framework with hybrid encoding for protein coding regions prediction in biological sequences, which effectively exploit global sequence order information, non-overlapping kmer features and statistical dependencies among coding labels. Evaluated on genomic and transcript sequences, our proposed method significantly outperforms existing state-of-the-art methods.

Availability The source code and the dataset used in the paper are publicly available at: https://github.com/xdcwei/DeepCoding/.

Contact jyzhang{at}mail.xidian.edu.cn

Supplementary information Supplementary data are available at Bioinformatics online.

Competing Interest Statement

The authors have declared no competing interest.

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license.