On the Depth of Deep Learning Models for Splice Site Identification

The success of deep learning has been shown in various fields including computer vision, speech recognition, natural language processing and bioinformatics. The advance of Deep Learning in Computer Vision has been an important source of inspiration for other research fields. The objective of this work is to adapt known deep learning models borrowed from computer vision such as VGGNet, Resnet and AlexNet for the classification of biological sequences. In particular, we are interested by the task of splice site identification based on raw DNA sequences. We focus on the role of model architecture depth on model training and classification performance. We show that deep learning models outperform traditional classification methods (SVM, Random Forests, and Logistic Regression) for large training sets of raw DNA sequences. Three model families are analyzed in this work namely VGGNet, AlexNet and ResNet. Three depth levels are defined for each model family. The models are benchmarked using the following metrics: Area Under ROC curve (AUC), Number of model parameters, number of floating operations. Our extensive experimental evaluation show that shallow architectures have an overall better performance than deep models. We introduced a shallow version of ResNet, named S-ResNet. We show that it gives a good trade-off between model complexity and classification performance. Author summary Deep Learning has been widely applied to various fields in research and industry. It has been also succesfully applied to genomics and in particular to splice site identification. We are interested in the use of advanced neural networks borrowed from computer vision. We explored well-known models and their usability for the problem of splice site identification from raw sequences. Our extensive experimental analysis shows that shallow models outperform deep models. We introduce a new model called S-ResNet, which gives a good trade-off between computational complexity and classification accuracy.

The field of biology has witnessed, in the last decades, important advances thanks to 10 major technological breakthroughs. Next Generation Sequencing (NGS) and other new 11 tools have made fundamental changes in the understanding of numerous biological 12 functions. The flood in biological data generated by NGS technology has motivated the 13 July 26, 2018 1/5 use of data-driven approaches such as Deep Neural Networks to help understanding 14 complex problems and extracting new biological knowledge. The ability of deep learning 15 to cope with a variety of data formats has allowed handling biological sequences such as 16 DNA, RNA or amino acid directly without a need for manual feature engineering. One 17 of the challenges in bioinformatics is accurate identification of splice sites in DNA 18 sequences. The discovery of splicing has elucidated the diversity of protein production 19 and explained the increased coding potential of the genome. The DNA sequence is categorized in about 1000 classes [1]. In order to learn from such a large amount of data 30 deep learning architecture requires a large model capacity. Models with deep 31 architecture have been proposed in the literature attaining 1000 layers [2][3][4]. However, 32 this depth has raised a degradation problem presented in lower training accuracy when 33 the number of layers extremely increased. Additionally, there are challenges in the 34 training such models. ResNet models introduced residual blocks which connect with 35 either identity mapping or projection shortcut [3]. ResNet structure helps to ease train 36 the deep architecture and obtaining a good accuracy even with large number of layers. 37 Deep learning model architecture are designed based on the learning task, number of 38 the parameters and size of the dataset. Well-known deep learning models, e.g., ResNet 39 and VGGNet, from computer vision [2] have been reused to build advanced systems for 40 text processing such as Very Deep Convolution Network (VDCNN) c [5] operating at 41 character level directly. Text modeling and sentence classification have been also tackled 42 with a small number of convolution layers such as one layer, two layers and six 43 layers [6][7][8]. In recent work, the trend has gone toward evaluating the impact of the 44 depth for text classification [7] and enhance the available techniques like generalizing 45 the max-pooling operation and replace it with K-max pooling [7] resulting in Dynamic 46 Convolution Neural Networks (DCNN) for semantic modeling of sentences. 47 Deep Learning has been also applied to splice site identification and has shown good 48 performance compared to other machine learning approaches [9][10][11]. However most of 49 existing deep learning methods are limited to simple models and no advanced 50 architectures have been evaluated or proposed. 51 In this paper, we explore well-known deep learning architectures that were initially 52 introduced for computer vision tasks and adapt them for the prediction of splice sites. 53 The findings of this work can be useful for other applications in Bioinformatics. We  The hyper-parameters were set, manually based on the validation set, after few trials to 99 find the best combination. The choices were unified for all the models as much as 100 possible. This is to remove external factors that could affect the evaluation and to keep 101 mainly the model depth for comparison. The dataset was split into training and 102 validation set with the percentage of 0.4 and 0.2 respectively to form 0.6 from the 103 original dataset. The rest was used as test set. All models were trained with he normal 104 initialization [13] with a batch size of 250. ReLU was used as activation function and 105 the last layer was fully connected with two nodes and softmax as activation function.

106
Stochastic Gradient Descend (SGD) [14] was used to optimize the cross-entropy loss 107 function with a learning rate of 0.01. The number of filters in the first convolution layer 108 is chosen to be 8. This number is doubled for each next block. We define a block by any 109 number of convolution layers between two max-pooling layers. Dropout with fraction of 110 0.5 was used as a regularizer with most of the architectures as detailed in the All the experiments were trained and tested using Splice Site Recognition (SSR) dataset. 114 The task is to locate the boundaries between coding sequence (exons) and the The same dataset has been previously used and is publicly available [15].

148
The number of convolution filters started with 8 and was doubled in each next 149 block. In addition, stride of 1 was used. This is justified by giving importance and stride s=2 followed with three fully connected layers [2]. In this work, the 154 shallow and deep model were evaluated only with 8 and 16 convolutions (named 155 VGGNet A and VGGNet E following the naming used in the original work). The 156 convolution layers between two max-pooling layers are considered as a block and 157 has the same number of filters. This number is doubled in the following block. In 158 this work, the number of filters started with 8 with the same z=3 and s=1 in all 159 the layers . Dropout function with ratio=0.5 for regularization the training after 160 the first two fully connected layers.

161
• ResNet: The novelty in ResNet [3] is the introduction of a residual shortcut Shortcuts. The numbers inside convolution layer boxes show the filter size x number of filters and in the max-pooling layer filter size x stride. We followed the same notation as in the original ResNet work for the shortcut by using dotted lines for projection shortcut and solid line for identity shortcut.
addition, shortcut are introduced from the input data rather than from the output 204 of the first convolution layer. Classical machine learning methods such as Logistic Regression (LG) [16], Support

208
Vector Machine (SVM) [16], [17], [15] and Random Forests (RF) [   Because the best training convergence differs from one architecture to another and 256 not necessary to achieve the best performance at 300 epochs, in the second set of 257 experiments, early stopping is used with patience of 10 epochs. Loss on validation set is 258 used to decide on stopping the training if no improvement is noticed. The metrics 259 average and standard deviation were based on 10 repetitions of the experiments and the 260 results are presented in Table 2. Early stopping improved the performance (accuracy 261 and AUC) for all the models. It is known that early stopping helps preventing from    Typically in computer vision, speech or text dataset the number of classes is large and 293 vary between 10s to 100s or more of classes, whilst, the splice site dataset used in this 294 paper has two classes (spice and non-splice DNA sequences). We plan to further explore 295 this justification in future work by considering other genomic datasets with large showed close performance to S-ResNet and equivalent to ShallowNet but exhibited high 302 computational cost due to its increased number of parameters and operations. accuracy and cost. The original work for ResNet [3] with an imaging dataset, ResNet-34 307 has smaller training and validation error than ResNet-18. This result is also valid in our 308 experiment for the training cost ( Figure 6 (a)), and not all the time for validation and 309 test result were the two curves keep fluctuating (Figure 6 (b,c)).  Over-fitting

311
The similarity in training and validation curves (or the improving with the validation 312 data with some cases) indicates that over-fitting is well controlled in our analysis.

313
Dropout function was used as a regularizer with all the experiment except S-ResNet. 314 We did more investigation on S-ResNet and test its performance with dropout 315 regularizer. Figure 7a and    better performance than models with deep architecture. S-ResNet ranked top and had 357 slightly better performance than the second model (ShallowNet). We conclude that 358 splice site classification does not require deep architecture and increasing the number of 359 convolution layers merely increases the computational cost without performance gain.

360
As future work, we plan to further investigate the role of depth in other classification 361 tasks using genomic sequence data. We will consider problems with larger number of 362 classes. We will also explore other models based on Recurrent Neural Networks.