Abstract
Since the mid-20th century, prokaryotic double-stranded DNA viruses producing tailed particles (“tailed phages”) were grouped according to virion tail morphology. In the early 1980s, these viruses were classified into the families Myoviridae, Siphoviridae, and Podoviridae, later included in the order Caudovirales. However, recent massive sequencing of prokaryotic virus genomes revealed that caudovirads are extremely diverse. The official taxonomic framework does not adequately reflect caudovirad evolutionary relationships. Here, we reevaluate the classification of caudovirads using a particularly challenging group of viruses with large dsDNA genomes: SPO1-like viruses associated with the myovirid subfamily Spounavirinae. Our extensive genomic, proteomic, and phylogenetic analyses reveal that some of the currently established caudovirad taxa, especially at the family and subfamily rank, can no longer be supported. Spounavirins alone need to be elevated to family rank and divided into at least five major clades, a first step in an impending massive reorganization of caudovirad taxonomy.
Introduction
Prokaryotic virus taxonomy is the formal responsibility of the Bacterial and Archaeal Viruses Subcommittee of the International Committee on Taxonomy of Viruses (ICTV). In recent years, the Subcommittee has focused on classifying newly described double-stranded DNA viruses producing tailed particles (“tailed phages”) into species and genera included in existing families in the order Caudovirales [1–5]. At the species rank, a similarity threshold of 95% nucleotide sequence identity is used for classification, i.e. viruses are classified into the same species if they shared >95% identity over the entire length of their genomes. Examination of wild populations of caudovirads infecting cyanobacteria demonstrated that such a 95% threshold robustly captures formally delineable species using population genetic metrics [6]. Extrapolation of such thresholds to viruses of the global surface oceans has resulted in population-scale ecological understanding of thousands of new virus “species” [7], which, however, do not have official status. At the genus rank, cohesive groups have been defined for viruses sharing a significant genome similarity(≥60% nucleotide identity), gene synteny, and a core gene set. This framework has helped to relatively rapidly establish official low-rank taxonomic positions for newly isolated and sequenced viruses [1,2,5,8,9].
The currently available ranks in virus taxonomy (in ascending order species, genus, subfamily, family, and order) limit the description of the full diversity of prokaryotic viruses. This limitation is particularly acute in the case of the order Caudovirales, which represents the most abundant and diverse group of viruses in the environment [10–12]. Indeed, the diversity of caudovirads greatly surpasses that of other bacterial, archaeal and eukaryotic double-stranded (ds)DNA viruses. A recent analysis of the dsDNA virosphere using a bipartite network approach, whereby viral genomes are connected via shared gene families, demonstrated that the global network of dsDNA viruses consists of at least 19 modules, 11 of which correspond to caudovirads [13]. The eight remaining modules encompass one or more families of eukaryotic or archaeal viruses. Consequently, each of the caudovirad modules could be considered to represent a separate family. Despite this remarkable sequence diversity, all caudovirads are currently classified into three families—Myoviridae, Podoviridae, and Siphoviridae. These families were historically established on the morphological, not genomic, features of their members, forming an artificial classification ceiling. In this study, the Subcommittee explored the diversity of the order Caudovirales based on the example of a large group of Bacillus phage SPO1-related viruses (Myoviridae: Spounavirinae), which forms a discrete caudovirad module [13,14].
The type virus of this group, Bacillus phage SPO1, was isolated in 1964 from soil in Osaka, Japan, using Bacillus subtilis as a host [15]. Bacillus phage SPO1 has been extensively studied ever since. Morphologically, the particle possesses an icosahedral head 87 nm in diameter and a 140-nm long (18.6-nm wide) contractile tail of “stacked-disk” appearance that is terminated by a complex baseplate structure [16,17]. The packaged genome is a 145.7-kb long, terminally redundant (13.1-kb) DNA molecule with thymine completely replaced by 5-hydroxymethyluracil (HMU) [18–20]. This genome encodes at least 204 proteins and five tRNAs [21].
In 1995, Bacillus phage SPO1 was assigned to the species Bacillus phage SP8 (renamed Bacillus phage SPO1 in 1996 and Bacillus virus SPO1 in 2015), which in 1996 was included into a monospecific myovirid genus currently called Spo1virus [1,22,23]. The subfamily “Spounavirinae” was proposed in 2009 by Lavigne et al. to harbor Bacillus phage SPO1, Staphylococcus phage Twort, Staphylococcus phage K, Staphylococcus phage G1, Listeria phage P100, and Listeria phage A511 [4]. This subfamily became official in 2012 [24]. The unifying characteristics of members of this subfamily as described by Klumpp et al. are that “(a) the[ir] host organisms are bacteria of the phylum Firmicutes; (b) [they] are strictly virulent myovirids; (c) all… feature common morphological properties; (d) [their genomes] consist of a terminally redundant, non-permuted dsDNA molecule of 127–157 kb in size; and (e) [they] share considerable amino acid homology” [20]. The inclusion requirement of a strictly lytic lifestyle became controversial when it was observed that a few related viruses (Bacillus phages Bcp1, Bp8p-T, and Bp8p-C) can persist in host cultures without causing the immediate lysis [25,26]. It remains unknown whether this persistence is due to virion entrapment inside bacillus spores or some other kind of semi-stable virus-host relationship.
By 2015, the subfamily Spounavirinae had been expanded to include five genera (Kayvirus, P100virus, Silviavirus, SPo1virus, and Twortvirus) and three unassigned or “floating” species (Enterococcus virus phiEC24C, Lactobacillus virus Lb338-1, and Lactobacillus virus LP65). It was already clear that Bacillus virus SPO1 represented one of two major lineages within the subfamily. Consequently, spounavirins were divided in the Bacillus phage SPO1-like viruses, which have modified DNA (HMU) and which produce particles possessing generally shorter tails; and, the Staphylococcus phage Twort-like viruses, which feature non-modified DNA and produce particles with longer tails. Both Bacillus phage SPO1-like and Staphylococcus phage Twort-like virus particles feature double-ringed baseplates and visible capsomeres as a morphologic hallmark [20,27]. A third group, represented by Bacillus phages sharing limited similarity with Bacillus phage SPO1 but related to phage Bastille had already been proposed at that time but not yet officially recognized [28]. In 2016, additional genera of Bacillus phage SPO1-like viruses (henceforth “spouna-like viruses”) were established, but not all of these were immediately included in the subfamily Spounavirinae [1,2].
In this study, we reevaluated the classification of spounavirins and spouna-like viruses. To this end, a well-defined set of 93 viruses was analyzed using complementary DNA and protein sequence analysis tools and phylogenetic methods. Our results indicate that the subfamily Spounavirinae fails to adequately reflect the diversity of its current members, and we therefore outline a better fitting classification scheme.
Results
General overview
To determine the phylogenetic relationship between 93 known and alleged spounavirins, we employed genomic, proteomic and marker gene-based comparative strategies. Regardless of the adopted phylogenetic approach applied, five separate, clear-cut clusters were identified. We believe that they clearly have a common origin and ought to come together under one caudoviral umbrella taxon. We propose to name this taxon “Herelleviridae,” in honor of the 100th anniversary of the discovery of prokaryotic viruses by Félix d’Hérelle (Table 1, Figs 1-3, S1 Table). The first cluster (here suggested to retain the name Spounavirinae) groups Bacillus-infecting viruses that are similar to Bacillus phage SPO1. The second cluster (“Bastillevirinae,” named after the type species Bacillus virus Bastille [28]) includes Bacillus-infecting viruses that have only limited similarity to Bacillus phage SPO1 and resemble Bacillus phage Bastille instead. The third cluster (“Brockvirinae,” named in honor of Thomas D. Brock [1926–], an American microbiologist and educator known for his discovery of hyperthermophiles, who worked on Streptococcus phages early in his career) comprises currently unclassified viruses of enterococci that are similar to Enterococcus phage ϕEF24C. The fourth cluster (“Twortvirinae,” named in honor of Frederick William Twort (1877–1950), the English bacteriologist who discovered prokaryotic viruses in 1915) gathers staphylococci-infected viruses that are similar to Staphylococcus phage Twort, whereas the remaining cluster (“Jasinskavirinae,” named in honor of Stanislawa Jasińska-Lewandowska (1921–1998), Polish scientist who was one of the first to study Listeria and their viruses) consists of viruses infecting Listeria that are similar to Listeria phage P100, the type isolate of the P100virus genus. The classification in five clusters left three viruses unassigned at this rank: Lactobacillus phage Lb338, Lactobacillus phage LP65, and Brochothrix phage A9.
Clustering was performed using nucleotide similarities (BLASTn, A) or translated nucleotide similarities (tBLASTx, B). Genomes were compared in a pairwise fashion using Gegenees, transformed into a distance matrix, clustered using R and visualized as trees using Itol. The trees were rooted at Brochotrix phage A9. Genera are delineated with colored squares and suggested subfamilies with colored circles.
Clustering was performed using the Phage Proteomics Tree approach (A) and proteomic distance (B). Distances were calculated pairwise between all sets of predicted proteomes, clustered with R and visualized using Itol. The trees were rooted at Brochotrix phage A9. Genera are delineated with colored squares and suggested subfamilies with colored circles.
Amino acid sequences were aligned with Clustal Omega and trees were generated using FastTree maximum likelihood with Shimidaira-Hasegawa tests. The scale bar represents the number of substitutions per site. The trees were rooted at Brochotrix phage A9. Genera are delineated with colored squares and suggested subfamilies with colored circles.
Suggested new classification of the 93 spounavirins and spouna-like viruses in the new caudoviral family “Herelleviridae”.
These robust clusters can be further subdivided into smaller clades that correspond well with the currently accepted genera. The evidence supporting this suggested taxonomic re-classification is presented in the following sections.
Genome-based analyses
BLASTn analysis revealed that the genomes of several viruses were similar enough to consider them strains of the same species (they shared >95% nucleotide identity, S1 Fig). The Staphylococcus viruses fell into four distinct, yet closely related groups corresponding to the established genera Twortvirus, Sep1virus, Silviavirus, and Kayvirus (S1 Fig). With the exception of Enterococcus phage EFDG1, all Enterococcus viruses clustered as a clade representing a new genus (here suggested to be named “Kochikohdavirus” after the place of origin of the type virus of the clade, Enterococcus phage ϕEF24C; [29,30]). The Bacillus viruses clustered into the established genera Spo1virus, Cp51virus, Bastillevirus, Agatevirus, B4virus, Bc431virus, Nit1virus, Tsarbombavirus, and Wphvirus, with three species remaining unassigned at the genus rank (Table 1). These results were also confirmed with VICTOR, a genome-BLAST distance phylogeny (GBDP) method (S2 Fig) and the Dice score (S3 Fig) [31], a tBLASTx-based measure that compares whole genome sequences at the amino acid level.
The patterns coalesced at a higher taxonomic level when the genomes were analyzed using tBLASTx (S4 Fig). The Enterococcus viruses clustered into a single group sharing 41% genome identity, whereas the Bacillus viruses fell into two major groups, a group combining the genera Spo1virus and Cp51virus, and the remainder. All Staphylococcus viruses clustered above ≈36% genome identity, whereas Listeria viruses grouped with more than 79% genome identity. Overall, all these genomes were related at the level of at least 15% genome identity. Lactobacillus and Brochothrix viruses remained genomic orphans, peripherally related to the remainder of the viruses in this assemblage.
Predicted proteome-based analyses
The virus proteomic tree shows four robust groupings mainly determined by the hosts that the viruses infect, corresponding largely with the suggested subfamilies (Fig 2). Viruses that infect Bacillus fell into two groups as described before, represented by the revised Spounavirinae subfamily and the suggested new subfamily “Bastillevirinae.” Similarly, the Listeria and Staphylococcus viruses formed their own clusters, “Jasinskavirinae” and “Twortvirinae”, respectively. This clustering suggests that the major Bacillus, Listeria, and Staphylococcus virus groups are represented, but that further representatives are required from the under-sampled groups. The suggested “Brockvirinae” subfamily is under-sampled, and the grouping observed in the tree is not as well-supported as the other clusters.
Among 1,296 singleton proteins and 2,070 protein clusters defined using the orthologous protein clusters (OPC) approach, we identified 12 clusters common for all viruses (Table 2, S2 Table). Classification of the viral proteins using prokaryotic virus orthologous groups (pVOGs) showed that 38 pVOGs were shared between all 93 virus genomes (Table 2, S3 Table). This finding was in stark contrast with the results from core genome analysis using Roary, which revealed only one core gene (the tail tube protein gene). Upon closer inspection of the gene annotations, we found that these analyses might have been confounded by the presence of introns and inteins in many of the core genes (S5–S6 Figs). Indeed, many genes of spounavirins and related viruses are invaded by mobile introns or inteins [32,33]. These gaps in coding sequences challenge gene prediction tools and introduce additional bias in similarity-based cluster algorithms.
Core genes with putative annotated functions identified in all 93 spounavirin and spouna-like virus genomes.
The pairwise comparison of the predicted proteome content of the viruses revealed a very low overall relatedness at the protein level (S7 Fig S7). The majority of viruses shared less than 10% of their proteins. However, at the suggested new subfamily rank, we observed obvious virus groups sharing their proteomes. The Enterococcus viruses (“Brockvirinae”) shared over 35% of their protein content. The members of the Bacillus virus genera Spo1virus and Cp51virus of the subfamily Spounavirinae (sensu stricto) had approximately 20% of their proteins in common, whereas the Bacillus virus genera Bastillevirus, B4virus, Bc431virus, Agatevirus, Nit1virus, Tsarbombavirus, and Wphvirus (“Bastillevirinae”) and the Staphylococcus virus genera Kayvirus, Silviavirus, and Twortvirus (“Twortvirinae”) shared over 25% and over 30% of their predicted proteomes, respectively.
Genomic fluidity is a measure of the dissimilarity of genomes evaluated at the gene level [34]. Accordingly, the genomic fluidity results followed those obtained using proteome content analysis (S8 Fig). Despite a high genomic fluidity for most of these viruses, the newly suggested subfamilies and genera were all supported.
The topology of the dendrogram obtained using the average amino acid identity (AAI) approach also supported the suggested new taxonomic scheme (S8 Fig). The AAI was greater than 35% within each subfamily and greater than 67% within each genus. The AAI of all viruses analyzed in this study was not lower than 22%. The members of the genus Wphvirus had the lowest AAI (76%) and the lowest AAI for a pair of proteomes (67% between Bacillus phage W.Ph. and Bacillus phage Eyuki) but surprisingly they had a mid-range genomic fluidity (0.15), suggesting that the protein sequences of wphviruses might have evolved rapidly.
The pangenome of the spounavirins and spouna-like viruses as calculated using Roary [35], consisting of 4,182 genes, was further analyzed by clustering the genomes based on the presence or absence of the accessory genes (S9 Fig). The obtained tree supported the current division of the viruses into approved genera and the suggested new subfamilies.
Many virus genomes are thought to be highly modular, with recombination and horizontal gene transfer potentially resulting in “mosaic genomes” [36,37]. By clustering the spounavirin and spouna-like virus genomes based solely on the gene order of their genomes, we investigated whether the gene synteny was preserved (S10 Fig). The results revealed that genomic rearrangements leave a measurable evolutionary signal in all lineages, since the genomic architecture analysis clustered all viruses according the suggested taxa with the potential exception of Bacillus phage Moonbeam [38]. However, we did not observe the high modularity that may be expected with rampant mosaicism. The lack of rampant mosaicism supports the recent findings by Bolduc et al. that at most about 10% of reference virus genomes have a high degree of mosaicism [14]. Thus, while the gene order in viruses belonging to the newly suggested family “Herelleviridae” is not necessarily strictly conserved, we observed a clear evolutionary pattern that is consistent with the sequence-based approaches tested in this study.
Single protein phylogenies
The phylogenetic trees based on comparisons of the major capsid, tail sheath, and DnaB-like helicase proteins are presented in Fig 3. All nine phylograms based on conserved proteins (OPC) are depicted in the trees in S11 Fig. For nearly all single marker trees, the topologies supported the suggested taxonomic scheme. Generally, each taxon is represented as a separate branch on the dendrogram. Notable exceptions could be found in two trees based on hypothetical proteins (OPCs 10357 and 10386). The first (10357) places revised the subfamily Spounavirinae as a subclade of “Bastillevirinae” and the second (10386) shuffled silviaviruses and kayviruses. This result may indicate that some degree of horizontal gene transfer occurs between groups, which share common hosts.
Discussion
Using the conventional definition of “tailed phage” families, Myoviridae, Podoviridae, and Siphoviridae (order Caudovirales), researchers effectively classified caudovirads for decades [39,40]. However, the classification of these viruses, defined by a traditional morphology-based approach, has been contested with the advent of high-throughput sequencing. The steadily increasing number of available genomes and debates on the impact of horizontal gene transfer, which marked the late 1990s and early 2000s, resulted in a decade-long moratorium on the introduction of any new taxonomy for prokaryotic viruses [41]. The increasing discrepancy between the official taxonomic framework and the emerging highvolume genomics and metagenomics research left ≈90% of prokaryotic viruses known from genome sequences unclassified beyond the family rank, i.e. they were classified as orphan species in a family. Consequently, prokaryotic virus diversity was vastly underappreciated, and virus genome curation remained in disarray. Recently, rather than implementing a ‘repeal and replace’ strategy for prokaryotic virus taxonomy, the Committee has introduced a holistic system, involving virus particle morphology, overall DNA and protein sequence identities, and phylogeny—an approach used for classification of all other viruses [1,2].
Successful modern taxonomic approaches must scale up to accommodate the increasing pace of prokaryotic virus discovery and be effective across the 4,470 putative complete prokaryotic virus sequences currently deposited in GenBank and other databases [42]. These scaling requirements have remained problematic, and although there are more than 3,400 publicly available caudovirad genomes, only 873 have been officially classified by the ICTV by now [43]. The remaining genomes are provisionally stashed within “unclassified” bins attached to the order Caudovirales or its member families, either because they still need to be classified in a species/genus/subfamily, or because they may be unidentified isolates of already classified viruses.
The growth in the number of prokaryotic virus genome sequences now supports the application of a range of genomic analyses for robust taxonomic classification [44,45]. Meanwhile, phylogenomic approaches are yielding to network-based approaches to better reflect the evolution of viral genomes [9,14]. These network-based approaches help to organize the viral sequence space into statistically-resolvable “viral clusters.” These clusters are approximately equivalent to ICTV-recognized genera and provide a taxonomic description that better reflects evolutionary relationships. Given the high computational demand of network-based approaches, however, the development of centralized resources and authority-monitored cyberinfrastructure, such as the iVirus platform on Cyverse [46] or the Joint Genome Institute Integrated Microbial Genomes Virus resource IMG/VR [47], will have to assist the prokaryotic virus community with large dataset computation and classification.
Based on the results of this study, we suggest that the group of spounavirin and spouna-like viruses be removed from the family Myoviridae and be given a family rank. Hence, we propose establishing the suggested family “Herelleviridae”, in the order Caudovirales next to a smaller Myoviridae and the established Podoviridae and Siphoviridae families. The new family would contain five subfamilies: Spounavirinae (sensu stricto), “Bastillevirinae”, “Twortvirinae”, “Jasinkavirinae”, and “Brockvirinae”, each comprising the ICTV-established genera listed in Table 1 (with additional information in S1 Table).
This study represents the first example of a true taxonomic assessment from an ‘ensemble of methods’. The de facto taxon splitting suggested here results from the observed diversity of prokaryotic viruses. We are encouraged that the combination of genome BLAST analyses, virus proteomic trees, core protein clusters, genomic synteny (GOAT), and single gene phylogenies yields consistent and complementary results, showing the robustness of the suggested taxa. In addition, the suggested genera correspond well with the taxonomy of the hosts (Bacillus, Listeria, Staphylococcus, and Enterococcus) indicating broader microbiological consistency. Moreover, only approximately 3% of viruses are left as unassigned at the genus and subfamily rank at this time within this group. These unassigned viruses may represent clades at the genus and subfamily rank that are still under-sampled.
This work demonstrates the usefulness of genome-based classification at a higher taxonomic rank and its ability to accommodate the complex viral diversity. Substitution of the families Podoviridae, Myoviridae, and Siphoviridae with a set of new families which more faithfully reflect the true genetic relationships of the viruses would clarify the taxonomic situation. However, this change does not remove the historically established virus morphotypes observed in the nature among caudovirads: myovirids forming particles with contractile tails, siphovirids forming particles with long non-contractile tails, and podovirids forming particles with short non-contractile ones. By disconnecting morphotype and family classification of caudovirads, taxonomically related clades can be grouped across morphotypes. This grouping includes the muviruses suggested to be classified in the family “Saltoviridae” [48] and potentially the broad set of Escherichia phage lambda-related viruses that are currently distributed among the families Siphoviridae and Podoviridae [49].
We believe that abolishment of the Podoviridae, Myoviridae and Siphoviridae families will soon be followed by the “upgrade” of existing viral taxonomy with additional taxon ranks required to accommodate the observed diversity in an orderly manner.
Materials and methods
Creation of the dataset
Genome sequences of known spounavirins and spouna-like viruses were retrieved from GenBank or (preferably) RefSeq databases in based on literature data, ICTV and taxonomic classifications provided by the National Center for Biotechnology Information (NCBI). Records representing genomes of candidate spounavirins and spouna-like viruses were retrieved by searching the same databases with the tBLASTx algorithm using terminase and major capsid proteins of several type virus isolates as a query (i.e., Bacillus phage SPO1, Staphylococcus phage Twort, Bacillus phage Bastille, Listeria phage A511, Enterococcus phage ϕEF24C, and Lactobacillus phage LP65) [50,51]. Sequences were manually curated and pre-clustered using CLANS (E-value cut-off 1e-10) to confirm their spounaviral affiliation [52]. This search yielded a set of 93 complete virus genomes, which were used in the following analyses (S1 Table).
The coding sequences in the genomes were re-annotated using PROKKA with the settings --kingdom Viruses, --E-value le-6 [53]. All genome sequences are available from NCBI (accession number information listed in S1 Table) or from Github (github.com/evelienadri/herelleviridae).
Genome-based analyses
Gegenees [54] was used in BLASTn and tBLASTx modes (fragment length 200 bp; step length 100 bp) to analyze virus genome nucleotide similarities. Pairwise identities between all genomes under study were determined using BLASTn and tBLASTx algorithms with default parameters [55]. Symmetrical identity scores (% SI) were calculated for each pairwise comparison using the formula
in which the HL is defined as the hit length of the BLAST hit, HI is defined as the percentage hit identity, QL is defined as the query length, and SL is defined as the subject length. Symmetrical identity scores were converted into distances using the formula
The resulting distance matrix was hierarchically clustered (complete linkage) using the hclust function of R [56]. Trees were visualized using Itol [57].
All pairwise comparisons of the nucleotide sequences using VICTOR, a Genome-BLAST Distance Phylogeny (GBDP) method, were conducted under settings recommended for prokaryotic viruses [58,59]. The resulting intergenomic distances (including 100 replicates each) were used to infer a balanced minimum evolution tree with branch support via FASTME including subtree pruning and regrafting (SPR) post-processing [60] for each of the formulas D0, D4, and D6, respectively. Trees were visualized with FigTree [61]. Taxon demarcations at the species, genus and family rank were estimated with the OPTSIL program [62], the recommended clustering thresholds [59], and an F value (fraction of links required for cluster fusion) of 0.5 [58].
Proteomic tree
The Phage Proteomic Tree was constructed as described previously [63] and detailed at https://github.com/linsalrob/PhageProteomicTree/tree/master/spounavirus. Briefly, the protein sequences were extracted and clustered using BLASTp. These clusters were refined by Smith-Waterman alignment using CLUSTALW version 2 [64]. Alignments were scored using PROTDIST from the PHYLIP package [65]. Alignment scores were averaged and weighted as described previously [63] resulting in the final tree.
Core protein clusters
Orthologous proteins were clustered using GET_HOMOLOGUES software, which utilizes several independent clustering methods [66]. To capture as many evolutionary relationships as possible, a greedy COGtriangles algorithm was applied with a 50% sequence identity threshold, 50% coverage threshold, and an E-value cut-off equal to 1e-10 [67]. The results were converted into an orthologue matrix with the “compare_clusters” script (part of the GET HOMOLOGUES suite) [65].
The orthologous protein clusters (OPCs) defined above were used to compute the genomic fluidity for each pair of genomes. For two genomes i and j:
with Ui being the number of genes of i not found in j and Mi being the number of genes in i [34]. The resulting distance matrix was hierarchically clustered (complete linkage) using the hclust function of R [56]. Trees were visualized using Itol [57].
Multiple alignments were generated for each OPC using Clustal Omega [68]. For each cluster, the amino acid identity between all protein pairs inside a cluster were determined using multiple alignment. For all genome pairs, the AAI [69] was then computed and transformed into distance using the formula:
The resulting distance matrix was clustered and visualized as described above.
OPCs and multiple alignments for each cluster were used to determine a distance similar to the distance used to generate the Phage Proteomic Tree. To estimate protein distances, in this case, the dist function of the seqinR package [70]was preferred to PROTDIST of the PHYLIP package [65] as the resulting distances are between 0 and 1. Proteomic distances were then computed using the same formula as for the Phage Proteomic Tree. The results were clustered and visualized as described above.
The Dice score is based on reciprocal BLAST searches between all pairs of genomes A and B [31]. The total summed bitscoresof all tBLASTx hits with ≥30% identity, alignment length ≥30 amino acids, and E-value ≤0.01 was converted to a distance DAB as follows:
In which SAB SAB and SBA represent the summed bitscores between tBLASTx searches of A versus B, and B versusu A, respectively, while SAA and SBB represent the summed tBLASTx bitscores of the self-queries of A and B, respectively. The resulting distance matrix was clustered with BionJ [71].
To investigate a genomic synteny-based classification signal, we developed a geneorder-based metric built on dynamic programming, the Gene Order Alignment Tool (GOAT, Schuller et al.: Python scripts are available on request, manuscript in preparation). GOAT first identified protein-coding genes in the 93 spounavirin and spouna-like virus genomes using Prodigal V2.6.3 in anonymous mode [72], and assigned them to the latest pVOGs [73]). pVOG alignments (9,518) were downloaded (http://dmk-brain.ecn.uiowa.edu/pVOGs) and converted to profiles of hidden Markov models (HMM) using HMMbuild (HMMer 3.1b2, [74]). Proteins were assigned to pVOGs using HMMsearch (E-value <10-2) and used to generate a synteny profile of every genome. GOAT accounted for gene replacements and distant homology by using an all-vs-all similarity matrix between pVOG pairs based on HMM-HMM similarity (HH-suite 2.0.16) [75]). Distant HHsearch similarity scores between protein families were calculated as the average of reciprocal hits and used as substitution scores in the gene order alignment. The GOAT algorithm identified the optimal gene order alignment score between two virus genomes by implementing semi-global dynamic programming alignment based only on the order of pVOGs identified on every virus genome. To account for virus genomes being cut at arbitrary positions during sequence assembly, GOAT transmutes the gene order at all possible positions and in both sense and antisense directions in search of the optimal alignment score. The optimal GOAT alignment score GAB between every pair of virus genomes A and B, was converted to a distance DAB as follows:
in which GAB and GBA represent the optimal GOAT score between A and B, and B and A, respectively, while GAA and GBB represent the GOAT scores of the self-alignments of A and B, respectively. This pairwise distance matrix was clustered with BionJ [71].
Prokka re-annotated genomes were used to create pan-, core-, and accessory genomes of all selected spounavirins and spouna-like viruses [53]. The annotations were analyzed using Roary [35] with a 50% length BLASTp identity threshold for homologous genes. Roary functions as follows: CD-HIT [76] was used to pre-cluster protein sequences and perform an all-vs-all comparison of protein sequences with BLASTp to identify orthologs and paralogs within the genomes. MCL [77] was then used to cluster the genomes based on the presence and absence of the accessory genes. The resulting tree file was visualized using FigTree v1.4.3 [61]. The tree was rooted in Brochothrix phage A9. The gene presence-absence output table from Roary was then imported into R and using a custom R-script (available from github.com/evelienadri/herelleviridae/tree/master). Pairwise shared gene contents were calculated for each combination of genomes.
Single gene phylogenies
Based on the OPC and pVOG analyses, we chose nine well-annotated protein clusters present in all 93 spounavirins and spouna-like viruses. Selected clusters included: DNA helicases, major capsid proteins, tail sheath proteins, two different groups of baseplate proteins, and four clusters with no known function. The members of these clusters were aligned using Clustal Omega with default parameters [47]. Resulting alignments were analyzed with ProtTest 3.4 [59] to determine a suitable protein evolution model (only variations of models compatible with downstream software like JTT and WAG were considered). Estimated models were used to generate phylograms with FastTree 2.1.7 [60]. The program implements the approximately maximum-likelihood method with Shimodaira-Hasegawa tests to generate the tree and calculate support of the splits. This approach is much faster than “traditional” maximum-likelihood methods with negligible accuracy loss [59–61].
Supporting information
S1 Fig. Heatmap of the blastn-based nucleotide similarities between pairs of genomes as calculated with Gegenees at default parameters.
S2 Fig. Genome-blast Distance Phylogeny as calculated using VICTOR.
S3 Fig. Heatmap of the DICE coefficient calculated between each pair of genomes.
S4 Fig. Heatmap of the tblastx-based nucleotide similarities between pairs of genomes as calculated with Gegenees at default parameters.
S5 Fig. Heatmap of the pairwise comparison of all genomes visualized as percentage of shared orthologous proteins (OPCs) as calculated on original GenBank files.
S6 Fig. Heatmap of the pairwise comparison of all genomes visualized as percentage of shared orthologous proteins (OPCs) as calculated on reannotated genomes.
S7 Fig. Heatmap of the pairwise comparison of all genomes visualized as percentage of shared proteins as calculated with Roary on reannotated genome files.
S8 Fig. Clustering trees of genomic fluidity and amino acid identity calculated pairwise between all genomes using orthologous protein clusters.
S9 Fig. Accessory genome clustering tree, calculated based on the presence and absence of accessory genes in each genome.
S10 Fig. Heatmap and clustering tree calculated by the Gene Order Alignment Tool and visualized as a distance matrix between all genome pairs.
S11 Fig. Maximum Likelihood trees of single gene phylogenies using protein clusters present in all 93 genomes.
S1 Table. Overview of the 93 phage genomes used in this study.
52 Table. Complete list of all orthologous proteins identified in the set of 93 spounavirin and spouna-like virus genomes.
S3 Table. Complete list of prokaryotic virus orthologous groups identified in the set of 93 spounavirin and spouna-like virus genomes.
Competing interests
All authors, except for MBPS, are members of the Bacterial and Archaeal Viruses Subcommittee of the International Committee on the Taxonomy of Viruses (ICTV). The authors declare no conflicts of interest.
Acknowledgments
We thank Laura Bollinger, Integrated Research Facility at Fort Detrick, for technical writing services. EMA would like to thank PH Nel for assistance with R scripting. JB was supported by the National Science Centre (Poland SONATA 12 grant number 2016/23/D/NZ2/00435). MBPS and BED were supported by the Netherlands Organization for Scientific Research (NWO) Vidi grant 864.14.004. RAE was supported by grants DUE-132809 and MCB-1330800 from the US National Science Foundation. HMO was supported by University of Helsinki funding for Instruct research infrastructure (Virus and Macromolecular Complex Production, ICVIR). AG was supported by a Chargé de Recherches fellowship from the National Fund for Scientific Research (FNRS, Belgium). FE was supported by the EUed Horizon 2020 Framework Programme for Research and Innovation (‘Virus-X’, project no. 685778). MBS would like to acknowledge the Gordon and Betty Moore Foundation Investigator Award (GBMF#3790) for funding.
Footnotes
↵* evelien.adriaenssens{at}gmail.com (EMA)
This paper is dedicated to Hans-Wolfgang Ackermann, a pioneer of prokaryotic virus electron microscopy and taxonomy, who died on February 12th, 2017, at the age of 80. He was involved in the early stages of this study, and his input is dearly missed.