PT  - JOURNAL ARTICLE
AU  - Aparajita Dutta
AU  - Kusum Kumari Singh
AU  - Ashish Anand
TI  - Deep learning models for identification of splice junctions across species
AID  - 10.1101/2021.06.13.448260
DP  - 2021 Jan 01
TA  - bioRxiv
PG  - 2021.06.13.448260
4099  - http://biorxiv.org/content/early/2021/06/14/2021.06.13.448260.short
4100  - http://biorxiv.org/content/early/2021/06/14/2021.06.13.448260.full
AB  - Deep learning models like convolutional neural networks (CNN) and recurrent neural networks (RNN) have been frequently used to identify splice sites from genome sequences. Most of the deep learning applications identify splice sites from a single species. Furthermore, the models generally identify and interpret only the canonical splice sites. However, a model capable of identifying both canonical and non-canonical splice sites from multiple species with comparable accuracy is more generalizable and robust. We choose some state-of-the-art CNN and RNN models and compare their performances in identifying novel canonical and non-canonical splice sites in homo sapiens, mus musculus, and drosophila melanogaster.The RNN-based model named SpliceViNCI outperforms its counterparts in identifying splice sites from multiple species as well as on unseen species. SpliceViNCI maintains its performance when trained with imbalanced data making it more robust. We observe that all the models perform better when trained with more than one species. SpliceViNCI outperforms the counterparts when trained with such an augmented dataset. We further extract and compare the features learned by SpliceViNCI when trained with single and multiple species. We validate the extracted features with knowledge from the literature.Competing Interest StatementThe authors have declared no competing interest.