DETIRE: A Hybrid Deep Learning Model for identifying Viral Sequences from Metagenomes

A metagenome contains all DNA sequences from an environmental sample, including viruses, bacteria, fungi, actinomycetes and so on. Since viruses are of huge abundance and have caused vast mortality and morbidity to human society in history as a kind of major pathogens, detecting viruses from metagenomes plays a crucial role in analysing the viral component of samples and is the very first step for clinical diagnosis. However, detecting viral fragments directly from the metagenomes is still a tough issue because of the existence of huge number of short sequences. In this paper, a hybrid Deep lEarning model for idenTifying vIral sequences fRom mEtagenomes (DETIRE), is proposed to solve the problem. Firstly, the graph-based nucleotide sequence embedding strategy is utilized to enrich the expression of DNA sequences by training an embedding matrix. Then the spatial and sequential features are extracted by trained CNN and BiLSTM networks respectively to improve the feature expression of short sequences. Finally, the two set of features are weighted combined for the final decision. Trained by 220,000 sequences of 500bp subsampled from the Virus and Host RefSeq genomes, DETIRE identifies more short viral sequences (<1,000bp) than three latest methods, DeepVirFinder, PPR-Meta and CHEER. DETIRE is freely available at https://github.com/crazyinter/DETIRE.


Introduction
High-Throughput Sequencing or called Next-Generation Sequencing (NGS) technology, which makes it possible to obtain the whole nucleotide sequences directly from environmental samples, has played important roles in many fields, such as pathogen detection [1,2] and human disease analysis [3][4][5]. In these applications, detecting viruses from metagenomic sequences becomes more and more essential because it is the first step in the analysis of viruses [6]. However, it is still a quite difficult task because of their relatively low abundances compared to those of bacteria and high mutation rates.
To overcome this challenge, several methods have been being proposed to identify viruses from metagenomes in the past several years, and can be categorized into similaritybased, machine learning-based and deep learning-based methods. Similarity-based methods generally map a query sequence to a reference data set and recognize it as the one with the highest similarity score [7][8][9][10][11][12][13][14][15][16][17]. These methods, however, suffer from long execution time during the mapping process, and hardly detect short viral sequences because of the limited features they have. Different to the similarity-based methods, machine learning-based methods could extract human-designed features from DNA sequences and classify them by a well-trained classifier, such as VirFinder [18] DB s=RefSeq). It has been proved that the less difference between the lengths of 6 sequences from training and testing set, the better classification result will be achieved 7 [19]. Thus all 13,274 viral sequences were split into a set of non-overlapped fragments 8 with a length of 500bp, resulting 778,390 fragments totally. The whole set of 500bp viral 9 fragments combined with 770,000 sequences of 500bp subsampled from 4,410 prokaryotic 10 host RefSeq genomes supplied in VirFinder [19], calling the GCN-training dataset, were 11 jointly used to train the GCN-based model. Then 110,000 viral sequences and 110,000 12 host sequences were randomly subsampled from the fragments above. Then they were 13 divided into the training and testing datasets at a scale of 10: DETIRE utilizes a two-stage strategy for virus prediction, including GCN-based sequence 18 embedding and deep learning-based sequence classification (Fig 1). Before embedding, 19 every 3-mer fragment is generated by a three-bases sliding window moving from the 20 head to the tail of the sequence with a stride of one. For example, the original nucleotide 21 sequence 'ATTGCCTGACAT' will be cut into 'ATT, TTG, TGC, GCC, CCT, CTG, 22 TGA, GAC, ACA, CAT'. The workflow of DETIRE. DETIRE contains GCN based sequence embedding model and a deep learning-based method to learn features automatically of viral sequences and identify them directly from metagenomes. First, the graph neural network learns the high-level representations of 3-mer fragments in each sequence through supervised back propagation. Then DETIRE extracts the features of spatial characteristics and sequential characteristics by designed CNN model and LSTM model, respectively. Finally, the learned features are combined together to make the final decision by several dense layers and a softmax layer.
In the process of sequence embedding, TextGCN [28] is utilized to learn the mean-24 ingful high-level representations of all 3-mer fragments from every nucleotide sequence. 25 Firstly, a heterogeneous graph containing 3-mer-fragment nodes is built in order to model 26 global co-occurrence between these 3-mer fragments explicitly. Then the built graph is 27 fed into a simple two-layer GCN [29]. The first layer constructs the nodes and edges. 28 Every nucleotide sequence in the GCN-training dataset and all unique 3-mer fragments 29 from it are constructed to their single nodes. There are no edges between each nucleotide 30 sequence. Edges are built between 3-mer fragments and their original sequences. All 31 edge between a sequence node and a 3-mer fragment node is determined by the term 33 frequency-inverse document frequency (TF-IDF) [30] of the fragment in the sequence, 34 where term frequency is the frequency of the 3-mer fragment appears in the sequence and 35 inverse document frequency is the logarithmically scaled inverse fraction of the number of 36 sequences that contain the 3-mer fragment. Point-wise mutual information (PMI) [31], 37 a popular measure for word associations, is employed to calculate weights between two 38 fragment nodes. The second layer learns the fragment and sequence embeddings in each 39 node. Finally, these nodes are fed into a softmax classifier, after which the cross-entropy 40 error over all labelled sequences is defined as the cost function [32]. After 500 epochs 41 of backpropagation by the Adam optimization algorithm [33] with a learning rate of 42 0.022 and a dropout rate of 0.5, the 30-dimension representations of all 3-mer fragments 43 in the second layer of the GCN are embedded into the sequences in the training and 44 testing datasets. In the process of sequence classification, two parallel deep learning models, CNN and 46 BiLSTM, are respectively used to learn spatial and sequential features of sequences. In 47 the CNN path, each embedded sequence is considered as an image to extract a spatial 48 feature through three sets of layers. Each set of the layer contains a convolutional layer 49 (16, 32 and 64 filters with size of 7*7, 5*5 and 3*3, respectively), a ReLU [34] activation 50 function, a max pooling layer (with a pooling size of 4*4 and a stride of 4), and a 51 batch normalization (BN) layer, respectively. In the BiLSTM path, the embedded 3-mer 52 fragments in a sequence are input into the BiLSTM cells (498 tokens totally) one by 53 one, generating a sequential feature. Then the first dense layer with 100 hidden neurons 54 receives the weighted merged two sets of features from the CNN path and BiLSTM path. 55 The second hidden layer after that contains 30 hidden neurons. Finally, a softmax layer 56 generates two scores that reflect the likelihood of the input sequence as a virus or not. 57 The weights of merging are two sets of trainable parameters which can be finetuned 58 during the training progress. All of the parameters here are updated by Adam [33] 59 optimizer with a mini-batch of 200 for 20 epochs to reduce the cross-entropy loss with a 60 learning rate of 0.03.  To deal with different lengths of sequences from the metagenome and to avoid vanishing 120 gradient problem in the process of training BiLSTM, a sequence longer than 500bp will 121 be divided into several non-overlapped sub-sequences of 500bp before input into BiLSTM 122 path. If the length of the last part in the sequence is shorter than 500bp, the last bases 123 of the sequence will be zero-padded and be regarded as a single subsequence. Then all 124 of the sub-sequences are input to the hybrid deep learning model one after another to 125 get their own scores, the average of which will be the final score and is contributed to 126 identifying whether the query long sequence is viral or not.

127
The accuracies, recalls, precisions and F1 scores of DETIRE, CHEER, PPR-Meta, 128 and DeepVirFinder on classifying viral and non-viral sequences from the CAMI Marine 129 metagenome are calculated and made a comparison in Table ??. DETIRE exceeds 130 DeepVirFinder at all of the four criteria on identifying all lengths of viral sequences. 131 DETIRE is better than PPR-Meta when the length is shorter than 3,000bp. CHEER 132 achieves the best performance on identifying viral sequences longer than 1,000bp except 133 the precision for length between 1,000bp and 3,000bp. For short sequences (<1,000bp), 134 DETIRE has a superior performance. For all lengths, DETIRE achieves the highest 135 accuracy, recall and F1 score than the other three methods. The accuracies, recalls, precisions and F1 scores of DETIRE, CHEER, PPR-Meta, and 138 DeepVirFinder are calculated according to the number of correctly and incorrectly 139 identification viral and host sequences from the real human gut metagenome dataset 140 (Table ??). DETIRE also achieves a better performance than the other three methods 141 on identifying short viral sequences (<500bp). In spite of a 0.0013 lower precision than 142 CHEER for length between 500bp and 1,000bp, DETIRE gets 0.0043, 0.0117 and 0.0052 143 higher accuracy, recall and F1 score. For sequences longer than 1,000bp, CHEER is 144 the best performing method. For all lengths, DETIRE achieves higher accuracy, recall 145 and F1 score than PPR-Meta, and DeepVirFinder. Since DETIRE identified less viral 146 sequences longer than 1,000bp than these shorter than 1,000bp, the overall accuracy, 147 recall and F1 score of DETIRE are lower than CHEER.

136
148 The testing time of the four methods on the testing dataset, the CAMI Marine 150 metagenome and the real human gut metagenome are made a comparison in Table ??. 151 The equipment used for the analysis is two Intel Xeon Gold 6226R (CPU) with the 152 memory of 256Gb. For all of the three datasets, DETIRE has the minimum time 153 consumption for the testing strategies. 168 The four models are tested by the testing datasets and made a comparison with 169 DETIRE, which is shown in Fig 4. The accuracy of DETIRE on identifying viral sequences 170 exceeds that of BOHEM and FOHEM by 3.62% and 4.39%, respectively, representing 171 the effectiveness of the GCN-based sequence embedding method in DETIRE. DETIRE 172 also has a superiority than single CNN-based and BiLSTM-based model.

176
In this paper, a deep learning-based hybrid model, DETIRE, is proposed to identify viral 177 sequences directly from metagenome. Encoded by a graph-based embedding method, 178 nucleotide sequences are fed into a CNN-path and a BiLSTM-path respectively for feature 179 extracting, before being classified by a softmax layer. In comparison to three latest viral 180 identification methods, DeepVirFinder, PPR-Meta and CHEER, on the test dataset, 181 the CAMI Marine dataset and a real human gut metagenome, DETIRE outperforms on 182 identifying short sequences (¡1,000bp). DETIRE will play significant roles in the natural 183 viral community analysis because of the huge number of short sequences generated by 184 the NGS technique. DETIRE is anticipated to play a positive role in the downstream 185 viral analysis such as viral taxonomy and pathogens identification.