Predicting stop codon reassignment improves functional annotation of bacteriophages

The majority of bacteriophage diversity remains uncharacterised, and new intriguing mechanisms of their biology are being continually described. Members of some phage lineages, such as the Crassvirales, repurpose stop codons to encode an amino acid by using alternate genetic codes. Here, we investigated the prevalence of stop codon reassignment in phage genomes and subsequent impacts on functional annotation. We predicted 76 genomes within INPHARED and 712 vOTUs from the Unified Human Gut Virome catalogue (UHGV) that repurpose a stop codon to encode an amino acid. We re-annotated these sequences with modified versions of Pharokka and Prokka, called Pharokka-gv and Prokka-gv, to automatically predict stop codon reassignment prior to annotation. Both tools significantly improved the quality of annotations, with Pharokka-gv performing best. For sequences predicted to repurpose TAG to glutamine (translation table 15), Pharokka-gv increased the median gene length (median of per genome medians) from 287 to 481 bp for UHGV sequences (67.8% increase) and from 318 to 550 bp for INPHARED sequences (72.9% increase). The re-annotation increased mean coding density from 66.8% to 90.0%, and from 69.0% to 89.8% for UHGV and INPHARED sequences. Furthermore, the proportion of genes that could be assigned functional annotation increased, including an increase in the number of major capsid proteins that could be identified. We propose that automatic prediction of stop codon reassignment before annotation is beneficial to downstream viral genomic and metagenomic analyses.


Main Body
Bacteriophages, hereafter phages, are increasingly recognised as a vital component of microbial communities in all environments where they have been studied in detail.Phages are known to drive bacterial evolution and community composition through predator-prey , whereby a stop codon is repurposed to encode an amino acid.Notably, annotations of Lak "megaphages" assembled from metagenomes were observed to exhibit unusually low coding density (~70%) when genes are predicted using the standard bacterial, archaeal and plant plastid genetic code (translation    4.These genomes and vOTUs were not constrained to one particular clade of viruses, being predicted to occur on both dsDNA viruses of the realm Duplodnaviria and ssDNA viruses of the realm Monodnaviria, suggesting it is a phenomenon that has arisen on at least two occasions (Supplementary Table 1).The lower frequency of these genomes in cultured isolates (INPHARED) versus human viromes (UHGV) may be due to culturing and sequencing biases, perhaps including modifications to DNA that are known to be recalcitrant to sequencing.
Although the mechanism for stop codon reassignment in phages is not fully understood, suppressor tRNAs are suggested to play a role 4,13 Prediction of stop codon reassignment led to improved annotations for both Prokka and Pharokka, although the extent of this varied with the two datasets, translation tables, and annotation pipelines tested.As Pharokka-gv outperformed Prokka-gv on all metrics tested, only Pharokka-gv is discussed further, and the equivalent results for Prokka-gv can be found in Supplementary Results.The largest differences were observed for sequences predicted to use translation table 15, for which Pharokka-gv increased the median gene length (median of per genome medians) from 287 to 481 bp for UHGV sequences (67.8% increase) and from 318 to 550 bp for INPHARED sequences (72.9% increase; Figure 1A).This was also reflected in an increase of median coding capacity from 66.8% to 90.0% for UHGV, and 69.0% to 89.8% for INPHARED (Figure 1B).Overall, these improved gene calls led to an increased gene length, and a reduction in the number of predicted genes per kb and the number of genes that could not be assigned functional annotations (Supplementary Figure 2; Supplementary Table 2).As it is commonly used as a phylogenetic marker for bacteriophages, we investigated how commonly the major capsid protein (MCP) could be identified with and without predicted stop codon reassignment 15 .For those viruses we predicted to use translation table 15, annotation using the default translation table 11 only resulted in the MCP being identified in 407/715 (56.9%) of the genomes.In contrast, using translation table 15 with Pharokka-gv, we could identify the MCP in 475/715 (66.4%).
When investigating the sequences for which translation table 4 was predicted to be optimal, a substantial increase was also observed for UHGV sequences, with Pharokka-gv increasing median gene length (median of per genome medians) from 350 to 518 bp (a 48.0% increase in length; Figure 1A), resulting in an increase of coding capacity from 78.0% to 90.4% (Figure 1B).However, the same was not observed for the 27 INPHARED genomes predicted to use translation table 4. Reannotation resulted in a modest increase in median gene length (median of per genome medians) from 573 to 588 bp (a 2.6% increase in length; Figure 1A).Median coding capacity was not increased, with both Pharokka and Pharokka-gv obtaining 89.1% (Figure 1B).As the median gene length and coding capacity for INPHARED sequences predicted to use translation table 4 are in line with expected values, their prediction may be a false positive.Reassuringly, the prediction of translation table 4 has not hindered the quality of annotations where it may be a false positive.The analysis of viral (meta)genomes relies on accurate protein predictions, with predicted ORFs being used in common analyses, including (pro)phage prediction, functional annotation, and phylogenetic analyses.The clear differences in protein predictions with/without predicted stop codon reassignment will likely have downstream impacts upon these analyses.However, this phenomenon is not yet widely considered in viral (meta)genomics.We have demonstrated the impacts of stop codon reassignment in the functional annotation of phages, and provide tools for the automatic prediction and annotation of viral genomes that repurpose stop codons.Our analysis highlights the need for accurate viral ORF prediction, and further experimental validation to elucidate the mechanisms of stop codon reassignment.

3 .
dynamics and their potential as agents of horizontal gene transfer.The use of viral metagenomics, or viromics, has massively expanded our understanding of global viral diversity and shed light on the ecological roles that phages play.Much of the study into viral communities has been conducted on the human gut.Here, viromics has uncovered ecologically important viruses that are difficult to bring into culture using standard laboratory techniques 1 , shown potential roles of viruses in disease states 2 , and allowed for the recovery of enormous phage genomes larger than any brought into culture As the majority of phage diversity remains uncharacterised, new and enigmatic diversification mechanisms are being described continually, including the potential use of alternative translation tables.Lineage-specific stop codon reassignment has been described previously in bacteriophages 4,5

10 . 9 .
pyrodigal-gv library to provide efficient Cython bindings to Prodigal-gv with pyrodigal Additionally, the virus discovery tool geNomad incorporates pyrodigal-gv to predict stop codon reassignment for viral sequences identified in metagenomes and viromes However, the detection of translation table 15 still has limited support in many tools, and the impacts of stop codon reassignment are rarely considered in viral genomics and metagenomics.To assess the extent of stop codon reassignment in studied phage genomes and the impacts on functional annotation, we extracted phage genomes from INPHARED 6 and predicted those using alternative stop codons.We also added high-quality and complete vOTUs from the Unified Human Gut Virome Catalog (UHGV; https://github.com/snayfach/UHGV)predicted to use alternative codons.The viral genomes were re-annotated using modified versions of the commonly used annotation pipelines Prokka 11 , and Pharokka

Figure 1 .
Figure 1.Re-annotating with predicted stop codon reassignment increases the quality of

table 11
12 implementing prodigal-gv/pyrodigal-gv for gene prediction (Supplementary Methods).Hereafter, the modified versions are referred to Prokka-gv and Pharokka-gv.From INPHARED, 49 genomes (0.24%) were predicted to use translation table 15, and 27 (0.13%) were predicted to use translation table 4. From the UHGV, 666 vOTUs (1.2%) were predicted to use translation table 15 and 46 (0.08%) were predicted to use translation table Although fewer of those predicted to use translation table 4 encoded the relevant suppressor tRNA, 22/27 (81%) of the INPHARED phages predicted to use translation table 4 were viruses of Mycoplasma or Spiroplasma.As Mycoplasma and Sprioplasma are known to use translation table 4, many of the viruses predicted to use translation table 4 may be simply using the same translation table as their host. .