VIGA: a sensitive, precise and automatic de novo VIral Genome Annotator

Viral (meta)genomics is a rapidly growing field of study that is hampered by an inability to annotate the majority of viral sequences; therefore, the development of new bioinformatic approaches is very important. Here, we present a new automatic de novo genome annotation pipeline, called VIGA, to annotate prokaryotic and eukaryotic viral sequences from (meta)genomic studies. VIGA was benchmarked on a database of known viral genomes and a viral metagenomics case study. VIGA generated the most accurate outputs according to the number of coding sequences and their coordinates, outputs also had a lower number of non-informative annotations compared to other programs.


34
Virology is a diverse scientific discipline. While many researchers are interested in discovering and 35 characterising pathogenic eukaryotic viruses [1], recently there has been an increased interest in 36 revealing bacteria-and archaea-infecting viral communities [2]. The number of viral metagenomic 37 studies is increasing due to the development of new sequencing technologies and the reduction in 38 costs. However, due to the volume of information that these platforms generate and the large 39 proportion of viral sequences sharing little or no homology to known viral genomes ('viral dark 40 matter', [3]), new bioinformatic tools are required to examine viral contigs and genomes [4]. 41 42 Viral annotation methods differ depending on the host organism. Bacteriophages and archaeal 43 viruses are annotated using prokaryotic genome annotation software or web-servers such as RAST 44 [5], Prokka [6] and RASTtk [7]. However, these bioinformatic tools are optimised for bacterial 45 sequences, not viruses (despite the improvements in RASTtk to annotate phage sequences [8]). In 46 contrast, eukaryotic viruses are annotated using close-reference based methods such as FLAN [9], 47 VIGOR [10] and ViPR [11]. In a similar way, VirSorter [12] and VirusSeeker [13] were designed to 48 predict putative prokaryotic viral contigs in metagenomic datasets. However, both programs predict 49 viral contigs according to the presence of viral proteins using reference databases, and close-50 reference homology-based methods can underestimate true viral diversity due to database 51 limitations [3,14]. Therefore, in this manuscript, we present a new modular and automatic de novo 52 genome annotation bioinformatic pipeline, called VIGA (VIral Genome Annotator), to annotate 53 viral sequences. 54

55
VIGA automatically detects open reading frames from a FASTA or multi-FASTA formatted file. 56 VIGA then annotates protein sequences by detecting homologues in a BLAST ("Slow") or a 57 DIAMOND ("Fast") protein database, with or without Hidden Markov Model (HMM) protein 58 detection against a protein database. The different methodologies for annotating viral contigs and 59 genomes allows the user to specify options that sacrifice annotation detail in exchange for increased 60 speed, which is required when dealing with larger metagenomic datasets. In addition, VIGA also 61 automatically detects (1) the topology of viral contigs, (2) the presence of rRNA, tRNA and tmRNA 62 sequences, (3) potential CRISPR repeats and (4) tandem or inverted repeat sequences. Finally, 63 VIGA outputs a FASTA file that includes user specified modifiers, a GenBank file and a five-64 column tab-delimited feature file to ease the upload of annotated contigs and genomes to various 65 database repositories and genome visualisation platforms. 66

68
Benchmarking of VIGA 69 Firstly, the accuracy and the precision of the number of viral coding sequences were estimated using 82 general linear models. Accuracy was measured by the slope, and precision was measured according 83 to the coefficient of determination (R 2 ). To compare all these linear models, analysis of covariance 84 (ANCOVA) was performed. In a general overview, the programs delivered different estimates of the 85 number of coding sequences (ANCOVA: p < 2×10 -16 ). In fact, although all programs tended to 86 overestimate the number of genes, VIGA provided the most accurate predictions (i.e. accuracy is 87 closest to one, Fig. 1A). Moreover, VIGA and Prokka had very similar values of precision (Table 1). 88 When compared according to viral host, similar results were found only in the case of eukaryotic considered, only RASTtk tended to overestimate the number of coding sequences (Table 1). 92 93 Secondly, F 1 score, a measure that combines precision and sensitivity, was used to predict the 94 quality of the coordinates of the viral coding sequences. Moreover, to evaluate the occurrence of 95 false positives (i.e. false coordinates considered as true; type I error) and false negatives (i.e. true 96 coordinates considered as false; type II error), false discovery rate (FDR) and false negative rate 97 (FNR) were examined. VIGA scored very highly for both bacteriophages and eukaryotic viruses. In 98 eukaryotic viruses the highest false discovery rate (FDR) was associated with RASTtk, while RAST 99 had the highest false negative rate (FNR). For bacteriophages the highest FDR and FNR were 100 obtained for Prokka. In the case of archaeal viruses, VIGA again had the highest precision, while 101 the highest sensitivity was noted in RASTtk (Table 2). 102 103 Finally, the power of prediction of all programs was measured by considering the number of non-104 informative annotations (i.e. all proteins classified as "hypothetical protein", "uncharacterized 105 protein", "ORF", "predicted protein", "unnamed product protein" or "gp[Number]"). For these 106 analyses, two different modes of running VIGA were considered -"Slow" (when BLAST and 107 HMMER are used to annotate the genes) and "Fast" (when DIAMOND alone is used for 108 annotation). Kruskal-Wallis (KW) test was performed to detect potential differences in the power of 109 prediction of all three programs (including both variants of VIGA) and significant differences 110 between the various programs were observed (KW test: p = 1.683×10 -53 ). In all cases, no significant 111 differences between VIGA-Slow and VIGA-Fast were found (Nemenyi test: p = 0.853). In fact, 112 while RAST and RASTtk had the highest number of non-informative annotations, both VIGA 113 modes had the smallest number ( Fig. 2A). Additionally, there were significant differences among 114 programs independently of the viral type (Table 3). In all cases, VIGA achieved optimal annotation, 115 having always the smallest number of non-informative annotations. In contrast, Prokka had the 116 highest amount of non-informative annotations in prokaryotic viruses (Figs. 2B-C) and RAST and 117 RASTtk had the highest amount of non-informative descriptions in eukaryotic viruses (Fig. 2D). programs to annotate the viral sequences and never before have these programs been benchmarked 219 against each other. In this study, we present VIGA, a new automatic de novo genome annotation 220 bioinformatic pipeline to annotate prokaryotic and eukaryotic viral sequences from genomic and 221 metagenomic studies. VIGA allows the most accurate, precise and sensitive annotation of viral 222 genomes when benchmarked using 138 known viral genomes. VIGA can be executed using BLAST 223 or DIAMOND to annotate proteins according to homology, with the option to also use HMMER to 224 improve these annotations based on HMMs. The use of HMMs will enrich the annotation detail of 225 the viral contigs, but will decrease the speed of the program. Where increased speed is required for 226 example when dealing with larger metagenomics datasets. 227 228

Workflow of the software 230
Overview. VIGA is an automatic de novo viral genome annotator implemented in Python 2.7 231 (requiring Biopython [45]) and designed to annotate complete and draft viral and phage genomes 232 comprising single or multiple contigs (Fig. 4). As an input, VIGA accepts a DNA FASTA file with 233 the (putative) viral contigs. These sequences are processed to predict the topology of the contigs 234 (i.e. circular or linear). If the contig is circular, the prediction of the origin of replication is 235 performed according to cumulative GC skew and realignment of the contig from the putative origin 236 of replication. Coding sequences (CDS) are predicted and, then, the function of these proteins is 237 inferred based on homology using BLAST [46] or DIAMOND [47] and, optionally, using Hidden 238 Markov Models (HMMER [48]). After that, a decision tree algorithm chooses the most reliable 239 description of the protein (Fig. 5) (Fig. 4) annotations are detected searching for the expressions "hypothetical protein", "uncharacterized 280 protein", "ORF", "predicted protein", "unnamed product protein" or "gp[Number]" in their BLAST 281 and HMMER descriptions. If such a description is present in both proteins, the protein will be 282 described as "hypothetical protein". However if the "hypothetical protein" description is only 283 present in BLAST, the consequent annotation retrieved by HMMER is considered as a valid one, 284 and vice versa. In the scenario where the protein is not labelled as "hypothetical protein" in either 285 BLAST or HMMER, it is checked if the percentage identity and coverage is higher in BLAST or in 286 HMMER. Depending of these results, BLAST output or HMMER output is chosen accordingly 287 (Fig. 5). Output files. After running all described steps, all saved information (contig shape, contig sequence, 301 protein coordinates, protein sequences, protein descriptions, rRNA and tRNA coordinates, tRNA 302 descriptions, and tandem and inverted repeats coordinates) is written to a GenBank file.
Additionally, the GenBank file is also converted to FASTA and TBL files after retrieving the 304 metadata from a plain text file. The FASTA and the TBL files are suitable for GenBank submission. 305 Optionally, a GFF file can also be created with this information. to infer the accuracy and the precision of the number of viral coding sequences, general linear 319 models were used. All linear models were forced to have intercept zero. The slope was used to 320 measure the accuracy, while the R 2 was used to measure the precision. Additionally, ANCOVA was 321 used to compare the linear models. Secondly, the prediction quality of the coordinates of the viral 322 coding sequences was evaluated by the F 1 score, the precision and sensitivity, defined as 323 test and post-hoc tests using Nemenyi tests were performed as described before. Moreover, to 353 discard the effect of the length size of contigs as a potential factor of the power of prediction, 354 Kruskal-Wallis tests were performed after classifying the contigs in three groups: "short" (<15 kb), 355 "medium" (15 -70 kb), and "long" (>70 kb