DeepCapTail: A Deep Learning Framework to Predict Capsid and Tail Proteins of Phage Genomes

The capsid and tail proteins are considered the main structural proteins for phages and also their footprint since they exist only in phage genomes. These proteins are known to lack sequence conservation, making them extremely diverse and thus posing a major challenge to identify and annotate them in genomic sequences. In this study, we aim to overcome this challenge and predict these proteins by using deep neural networks with composition-based features. We develop two models trained with k-mer features to predict capsid and tail proteins respectively. Evaluating the models on two different testing sets shows that they outperform state-of-the-art methods with improved F-1 scores.

Phages or bacteriophages are viruses that infect bacteria. These microorganisms can 2 reproduce through two different life cycles, lysogeny and lytic. For the lysogeny cycle, 3 the phage integrates its genome with the bacteria genome and stays there. In this cycle, 4 the phage becomes part of the bacterial genome and replicates together with the 5 bacteria; whereas for the lytic cycle, the phage enters the bacterial cell, uses its 6 machinery to replicate, reproduce new phages, and then lyses the cell membrane to 7 disperse into the environment, resulting in death of the invaded bacterium [1]. 8 Phages are getting increasing attention primarily due to the advent of the shotgun 9 metagenomic sequencing. This technology enables comprehensive sampling of the 10 genomes that are present in a given environmental sample, such as soil and seawater, 11 while circumventing the culture of the microorganisms, which is both labor intensive 12 and often infeasible [2]. Furthermore, there is increasing interest in characterizing the 13 interactions between phages and their bacterial hosts. Understanding the interactions 14 has important implications, one of which is in combating antibiotic resistance in 15 bacteria where phages can be introduced to infect and kill pathogenic bacteria or 16 induced into lytic stage if already integrated in bacteria [3]. 17 However, uncovering viral sequences has been challenging. Despite being the most 18 abundant organisms on earth with an estimate of more than 10 30 [4], only 8108 19 complete virus genomes are curated at NCBI currently. Consequently, methods for 20 predicting/annotating viral sequences that rely heavily on reference databases, such as 21 the alignment-based methods, are not effective in detecting novel viruses and phages. 22 Indeed, if the input sequence does not align to any sequence in the reference database, it 23 would be annotated as unknown sequences. This problem is further exacerbated by the 24 lack of well established taxonomic and phylogenetic relationships in viruses and phages 25 as they do not have the ribosomal genes that are conserved universal markers in other 26 organisms for phylogenetic classifications [5]. To overcome these challenges, composition-based methods were introduced to 28 circumvent the requirement of universal or markers genes [6][7][8]. The composition-based 29 methods use the composition of the sequence, such as k-mers, as features to train 30 machine learning models and then use the trained models to predict the taxonomy of 31 new sequences. The composition-based prediction methods can make prediction on any 32 sequence even if it does not align to the reference database. 33 Many studies built machine learning models to classify a given sequence to either 34 viral or nonviral sequences. For example, Feng  degradation by the host enzymes. The capsid also acts as a mediator to infect bacteria 51 by attaching the phage to its host and enabling its penetration through the host 52 membrane.

53
These vital roles of capsid and tail proteins motivated Seguritan et al. [7] to develop 54 iVIREONS to predict them. iVIREONS consists of a set of 30 artificial neural network 55 models that use amino acid frequency and isoelectric as features. However, the set of 56 models can output different predictions for the same input, which challenges the user to 57 determine the correct prediction. More recently, another machine learning model 58 VIRALpro was developed to also predict capsid and tail sequences [8] using SVM with 59 amino acid frequency and HMM models. VIRALpro outperforms iVIREONS [8], but is 60 remarkably slower since it uses HMM.

61
In this study, we built two machine learning models that also predict capsid and tail 62 proteins respectively. For this purpose, we used the deep neural network models; these 63 models are considered the most modern machine learning models to date, known for 64 their exceptional performance that outperform their predecessors [12]. They have been 65 extensively used in the fields of computer vision and natural language processing [12], 66 and only recently gained attention in the field of genomics [13]. However, to our 67 knowledge, there has not been any study that harnesses the power of these models to 68 predict capsid and tail proteins. 69 We propose two distinct deep neural network models that predict capsid and tail 70 phage protein respectively. We trained the models using k-mer frequency as features and 71 examined different k-mer sizes ranging from one to four. We evaluated the models with 72 two test data sets and compared our models with iVIREONS [7] and VIRALpro [8].

75
We collected all the phage and prophage sequences from Phaster [14]. The Phaster 76 database consists of curated phage and prophage proteins taken from NCBI and the 77 prophage database [15]. This database is publicly available and regularly updated.  The capsid, tail, and nonstructural proteins were annotated similarly to 86 iVIREONS [7] and VIRALpro [8], that is, the description of the fasta files was used to 87 annotate the proteins.  identity of less than 20%, and more than 30% have an identity between 20% and 40%. 108 These best-hits are generated by blastp: the testing set, whether the representative or 109 the independent set, is the query, and the training set is the subject.
110 Table 1 shows the number of sequences used in the training and testing sets for 111 capsid and tail models. The training and testing sets are selected randomly (see 112 supplemental Figure 1 for details on how we built these sets).

113
Extraction of k-mer Features 114 K-mer frequencies of the protein sequences were used as features to train the different 115 deep learning models. Various k-mer sizes were examined ranging from one to four.   Table 2. number of features based on the k-mer size. k The number of features goes exponentially when the size of k-mer increases.

125
For all models, the following parameters were used: "relu" as activation function,

126
"adam" as optimizer, and "binary crossentropy" as loss function. The models were 127 trained using 150 epochs with a batch size of 10. The architectures as well as the 128 parameters used were determined through extensive experimentation. 129 We provide a naming convention for these models in    Accuracy, F1-score, Recall, and Precision were used to assess the prediction of the 144 trained models. We present the formula of Accuracy, F1-score, Recall, and Precision in 145 Equations 1, 2, 3, and 4 respectively: TP is the number of capsid or tail sequences that are classified correctly, TN is the 147 number of the nonstructural sequences that are classified correctly, FP is the number of 148 the nonstructural sequences that are classified incorrectly (as either capsid or tail), FN is 149 the number of capsid or tail sequences that are classified incorrectly (as nonstructural). 150

152
DeepCapTail is a publicly available framework that predicts capsid and tail proteins.

153
This framework can be downloaded at https://github.com/Dhooha/DeepCapTail. This 154 framework consists of a machine learning project written in Python and uses the 155 scikit-learn library [16]. Figure 4 shows the steps followed to build DeepCapTail: (1) 156 different deep neural network architectures were investigated with different k-mer sizes 157 in order to decide on the most effective architectures as well as k-mer sizes; (2) the most 158 effective deep neural networks were used to train capsid and tail models using the 159 training set; (3) these models were tested using two distinct testing sets, dubbed 160 representative and independent testing sets; (4) the best capsid and tail models were 161 selected based on the F1-score.   10-fold cross-validation took less than 30 minutes using a high performance computing 177 system: Intel's Broadwell processors (2 x E5-2683v4 2.1GHz ) with 128 GB of memory 178 (2400 MHz) and 32 cores). F1-score was reported instead of the accuracy because 179 F1-score is more reliable when the positive and negative classes are imbalanced, which is 180 the case of our data. The next sections show the performance analysis of all the models 181 presented in Figures 5a and 5b. 182 All the models that use k-mer size ≤ 4 have either the same or lower F1-scores 183 compared to at least one of the models that use lower k-mer sizes and therefore were 184 not considered further for the remaining study (e.g., Figure 5a indicates that the The representative testing set includes sequences that are highly identical to the 214 training set. It assesses the capsid and tail models when the input sequence happens to 215 be similar to the training sequences. Figures 6a and 6b show the ROC curves of capsid 216 and tail models using the representative testing set. 217 Figure 6a shows that the ROC curves of the different capsid models are extremely 218 similar: their AUCs are between 96% and 97%. Figure 6b shows that the ROC curves 219 of the tail models are similar as well: their AUCs are between 93% and 97%. The AUCs 220 of capsid and tail models using the representative testing set are both greater than 90%. 221 A cut-off of 0.5 was used to compute the accuracy, F1-score, recall, and precision of 222 these models; Table 3 shows the results. ROC curves of capsid and tail models using the representative and independent testing sets. Figures 6a and 6b are for the evaluation of capsid and tail models using the representative testing set. Figures 6c and 6d are for the evaluation of capsid and tail models using the independent testing set. Table 3 shows that the capsid and tail models performed exceptionally well on the 224 represented testing set (e.g., F1-scores are equal or higher than 89%). The models that 225 use k-mer size ≤ 2 or ≤ 3 outperformed the models that use k-mer size ≤ 1 (e.g., the 226 F1-score of Capsid 400:200:100:50 using k-mer size ≤ 2 and ≤ 3 is 92% compared to 227 90% for the same architecture using k-mer size ≤ 1). However, it is difficult to know if 228 the models that use k-mer size ≤ 3 are better than the models that use k-mer size ≤ 2 229 since they have the same F1-score. These observations are valid for both capsid and tail 230 models. The next section shows the details on how these models performed differently 231 on the independent testing set.

232
Using The Independent Testing set

233
The independent testing set is another testing set used to assess the capsid and tail 234 models. Contrarily to the representative testing set, the independent testing set is 235 substantially different from the training set. It evaluates the capsid and tail models 236 when the input sequence happens to be highly divergent from the training sequences, 237  Figures 6c and 6d show the ROC curves of capsid and tail models using the 239 independent testing set. The performance of these models dropped, which is expected 240 because the sequences of the independent testing set are highly divergent from the 241 sequences of the training set (e.g., the AUC of 'Capsid 400:200:100:50, ≤ 3' dropped 242 from 97% to 81%. The AUC of 'Tail 600:300:150:60, ≤ 3' dropped from 93% to 82%).

243
Both capsid and tail AUCs are above 80%. 244 We compute the accuracy, F1-score, recall, and precision of these models using a 245 cut-off of 0.5. Table 4 shows the results. Contrary to the results of the representative testing set, it is easier to distinguish the 247 best models for capsid and tail predictions using the independent testing set. The best 248 capsid model is 'Capsid 400:200:100:50, ≤ 3', with an accuracy and F1-score of 76% and 249 66% respectively. The best tail model is 'Tail 600:300:150:60, ≤ 3', with an accuracy 250 and F1-score of 76%. We compare our best models with state-of-the-art prediction 251 programs in the next section. with State-Of-The-Art using Two Different Testing sets 254 We compare our best capsid model 'Capsid 400:200:100:50, ≤ 3' and our best tail model 255 'Tail 600:300:150:60, ≤ 3' with two state-of-the-art programs, iVIREONS [7] and 256 VIRALpro [8]. To this end, the representative and independent testing sets were used 257 and results are shown in Figure 7.  [7] and VIRALpro [8] with both the 260 representative and independent testing sets. Using the representative testing set, our 261 capsid model 'Capsid 400:200:100:50, ≤ 3' has a F1-score of 92% compared to 0% and 262 40%, and our tail model 'Tail 600:300:150:60, ≤ 3' records a F1-score of 92% compared 263 to 73% and 0.03% for iVIREONS and VIRALpro respectively.

264
The successful prediction of most of the sequences in the representative testing set is 265 expected as our models were trained on similar sequences. However, iVIREONS and 266 VIRALpro were trained on sequences that are different from the representative testing 267 set. Supplemental Figure 2 shows that more than 60% of the training set of the two 268 models have an identity less than 40% to the representative testing set, which can be 269 the reason why they were unable to perform as well as our models. 270 However, using the independent testing set, the performance of our models dropped, 271 but still performed better than iVIREONS and VIRALpro. Our capsid model 'Capsid 272 400:200:100:50, ≤ 3' outperforms iVIREONS and VIRALpro with an F1-score of 67% 273 compared to 0% and 57% respectively, and our tail model 'Tail 600:300:150:60, ≤ 3' 274 presents an F1-score equal to 76% compared to 67% and 72% for iVIREONS and 275 VIRALpro respectively.

276
The performance of our models dropped with the independent testing set because 277 the testing set is substantially different from the sequences used to train the models.

278
The independent testing set is also different from the training set used by both 279 iVIREONS and VIRALpro.

280
Supplemental Figure 3 shows that more than 70% of the training set used by these 281 models have an identity less than 40% to this testing set. This means all of these 282 models are tested on sequences that are different from their training sequences, and for 283 this reason, we consider the independent testing set unbiased compared to the 284 representative testing set. We compare the performance of 'Capsid 400:200:100:50, ≤ 3' and 'Tail 600:300:150:60, 288 ≤ 3' with iVIREONS [7] and VIRALpro [8] using viral metagenomic data. We use the 289 same viral metagenomic data that were employed by VIRALpro [8] to assess their 290 capsid and tail predictors. This data consists of five different metagnomic datasets with 291 no homology to known proteins. These five datasets do not have any capsid or tail 292 annotation, and this is why we cannot compute F1-scores on this dataset. We present 293 relevant details about these datasets in Table 5 (more details on these datasets can be 294 found in [8]). We present in Figure 8 the Venn Diagram of capsid and tail predictions using the 296 metagenomic dataset of Oresund Struct. As we detailed in Table 5, this dataset is 297 identified as structural viral proteins by MS-based proteomics [8]. For tail prediction, 298 our model as well as iVIREONS and VIRALpro agreed on the prediction of most of the 299 tail sequences: 33 sequences were identified by all these models as tail sequences.

300
However, for the capsid prediction, our model and the iVIREONS and VIRALpro agreed 301 on the prediction of only two capsid sequences. It is difficult to know if these predictions 302 are correct, since we do not have the annotation of capsid and tail proteins for this 303 metagenomic dataset. The Venm Diagram of the prediction of capsid and tail proteins 304 for the four remaining of metagenomic datasets is shown in the supplemental material. We proposed the deep learning models 'Capsid 400:200:100:50, ≤ 3' and 'Tail 307 600:300:150:60, ≤ 3' that predicts capsid and tail proteins of phages. We evaluated 308 these models using two different testing sets. Our models outperformed the 309 state-of-the-art iVIREONS and VIRALpro, which suggests that our models are more 310 accurate in prediction. We also compared the performance of our models, iVIREONS 311 and VIRALpro using five different viral metagenomic datasets. All of these models 312 agreed on the annotation of some of the capsid and tail proteins; however, it is difficult 313 to assess the accuracy of these models since the correct answer is not known.