assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing

ABSTRACT

critical given the modern pace of viral discovery in which, as of November 2018, hundreds of thousands of metagenome-derived viral genomes and large genome fragments (735,112 at IMG/VR 25 , with a total of 110,384 'species' or viral OTUs) now dwarf the 34,091 available from prokaryotic virus sequences in the NCBI GenBank database 26 .Together with the recently proposed 'minimum information about uncultivated virus genomes' (MIUViGs) community guidelines 27 , evaluation of approaches to establish a scalable, genome-based viral taxonomy is needed to enable a universal classification framework.The ability to classify thousands of microbial UViGs is invaluable for foundational hypothesis testing anywhere microbes, and their viruses, might matter.With microbiome and phage therapy research exploding, this includes ecosystems such as industrial bioreactors and the human gut and lung, as well as the oceans, the deep-sea floor, and soils.
Multiple genome-based strategies have been proposed to develop such a unified taxonomic framework for viruses infecting bacterial 15,[28][29][30][31][32][33] , archaeal 34 and eukaryotic 35 hosts.For bacterial viruses ("phages"), an early approach to a universal taxonomy targeted phage relationships only by using complete genome pairwise protein sequence comparisons in a phylogenetic framework (the "phage proteomic tree") and was broadly concordant with ICTV-endorsed virus groupings of the time 15 .Such efforts were not widely adopted, presumably because (i) the demand was limited (few metagenomics studies existed), and (ii) the paradigm was that "rampant mosaicism" would blur taxonomic boundaries and violate the assumptions of the underlying phylogenetic algorithms used in the analyses 17 .Other efforts sought to establish percent of genes shared and percent identity of shared genes cut-offs to define genera and sub-family affiliations 36,37 , but lacked taxonomic resolution for several virus groups.This lack of resolution was due to the likelihood that the mode and tempo of prokaryotic virus evolution could vary significantly across the viral sequence space 22 .Building upon a prokaryotic classification algorithm, the Genome Blast Distance Phylogeny (GBDP) 38 , a freely accessible online tool (VICTOR) now provides for classification of phage genomes via combined phylogenetic and clustering methods from nucleotide and protein sequences 31 .Although a key advance, this method suffers from limited scalability (100 genomes limit) and taxonomic assignment challenges for viruses that lack closely related reference genomes.
Alternatively, several groups reasoned that the highly variable evolutionary rates across phage sequence space could be examined through gene sharing networks 29,30,39 to determine whether a meaningful structure, and therefore taxonomic signal, occurs in this space.These networks, based on shared protein clusters (PCs) between viral genomes, were largely concordant with ICTV-endorsed taxa independent of whether monopartite 29 (a single node type, i.e., viral genomes) or bipartite networks 34,39 (two node types, i.e., viral genomes and genes) were used.Given these successes, we revisited the monopartite gene sharing network approach to establish an iVirus 40 app (vConTACT v1.0, hereafter v1.0) to automate a network-based classification pipeline for prokaryotic virus genomes.Performance tests indicated that the network analytics used by v1.0 produced viral clusters ('VCs') that were ~75% concordant with recently established ICTV prokaryotic viral genera, even with seven times more genomes now available 30 .The capacity to incorporate these genomes and accuracy of the network-based analytics have resulted in viral taxonomy applications across large-scale studies of ocean 41,42 , freshwater 43 and soil 44 , and studies of single-virus amplified genomes (vSAGs) 45,46 -all environments where the viruses observed were largely so novel as to not be classifiable outside of these gene-sharing network approaches.These v1.0 advances were an important step forward, but they could not be used for automated tentative taxonomic assignments.This is because v1.0 created artifactual VCs of both undersampled genomes and highly overlapped regions of viral sequence space 30 , and also lacked several key, community-desired features including per-VC confidence metrics, a metric for establishing hierarchical taxonomy, and scalability.
Here we introduce and evaluate vConTACT v2.0 (hereafter v2.0), which significantly updates the network analytics and feature set of the original program.We apply this program to (i) establish a centralized, 'living' taxonomic reference network as a foundational community resource and (ii) demonstrate that v2.0 is robust and scalable to modern datasets.

vConTACT v2.0 key features and updates
The underlying goal of vConTACT is to automatically assign viral genomes into relevant established or tentatively new taxa, with performance assessed relative to ICTV-assigned, manually-curated taxa (see Fig. 1 for conceptual overview).However, in the current ICTV taxonomy for prokaryotic viruses, taxonomic classifications above the genus level are only sporadically available.For example, of the 2,304 prokaryotic virus genomes available at RefSeq, 84.2% are unclassified at the subfamily level, and 61.6% are unclassified at the order level with virtually all of the remaining 38% lumped into a single "Caudovirales" order.Moreover, among the Caudovirales, the three phenotypically recognized and dominant bacterial virus family level designations -Podoviridae, Myoviridae and Siphoviridae -are being called into question by genome-based taxonomy methods [47][48][49] and are thus in flux.Therefore, we focused specifically on assigning viruses at the genus level, as it is the most 'stable' level and constitutes the principal taxon on which molecular classification is based on in the ICTV taxonomy.
In a network-based genome taxonomy framework (Fig. 1a), related genomes emerge as a group of nodes strongly connected through multiple edges, here termed a Viral Cluster, or 'VC'.In a taxonomic context and based on the clustering of viral reference genomes, we have previously demonstrated that the network parameters can be tuned such that the VCs best represent genus-level grouping of viral genomes 30 .In the original v1.0, ~75% of VCs corresponded to established ICTV genera 30 ('concordant VCs'), while the remaining mismatched VCs were termed 'discordant VCs'.These mismatches in the discordant VCs can occur from any of three scenarios (Fig. 1b): (i) VCs that wrongly clustered genomes with no close relatives (termed 'outlier genomes' from 'undersampled VCs'), (ii) a given VC that contained multiple ICTV genera represented by virus genomes that shared many genes and/or modules with other VCs (termed 'overlapping VCs'), and (iii) VCs that encompassed multiple ICTV genera sharing many genes and/or gene modules across genomes within the VC, and within subsets of the genomes in the VC (termed 'structured VCs').While a significant advance, v1.0 struggled to adequately differentiate between these issues.Further, v1.0 lacked several key features to enable broader adoption and utility as described above.
To address these issues in v2.0, we evaluated and ultimately implemented a new clustering algorithm, established confidence scores and distance-based taxon separation for hierarchical taxonomy, and optimized and evaluated scalability and robustness to a large-scale viral metagenomic dataset.Briefly, after the MCL-clustered protein clusters are generated, we optimized the protein-cluster-based genesharing information to establish an automated two-step process whereby VCs are defined using ClusterONE 50 (CL1), instead of MCL used in v1.0, and then subdivided using hierarchical clustering to disentangle problematic regions of the networks (Fig. 1b, see Online Methods).This approach considers edge weight (i.e., degree of connection between genomes) to identify outlier genomes that are weakly connected with members of their VC compared to neighbour genomes, detect and separate genomes that 'bridge' overlapping VCs, and break down structured VCs into concordant VCs through distance-based hierarchical clustering (Fig. 1b).
Additionally, v2.0 incorporates confidence scores for each VC to help differentiate between meaningful taxonomic assignments and those that might be artifacts.Briefly, each VC receives two types of confidence scores: a topology-based score (value range 0-1), which aggregates information about network topological properties, and a taxonomy-based score (value range 0-1), which estimates the likelihood of predicted VCs to be equivalent to a single ICTV genus (Online Methods).Higher values indicate either more confident linkages within the VC or better taxonomic agreement for the topology and taxonomy-based scores, respectively, and the taxonomy-based score is used to automatically optimize the hierarchical clustering of structured VCs into ICTV-concordant 'subclusters'.
Finally, although v2.0 is presented here as a monopartite (one type of node) network tool, it produces the necessary output to also be visualized as a bipartite network (Supplementary Fig. 1).In these bipartite visualizations, two types of nodes are used to display genomes and their connecting, shared protein clusters (PCs).This information about which PCs link a given set a viruses together are also provided (Supplementary Table 1; see Online Methods), as it enables researchers to identify specific core virus group genes that may be of value for establishing novel gene markers and other additional downstream analyses.

General performance comparison of vConTACT versions 1.0 and 2.0
To assess clustering performance of v1.0 and v2.0, we quantified ICTV correspondence for the 940 archaeal and bacterial virus genomes that had ICTV genus-level classification (accessed January 2018, see Online Methods and Supplementary Table 2).Clustering performance was evaluated through a composite performance score of Accuracy (Acc) and Separation (Sep).Both Acc and Sep are aggregate measures themselves (see Online Methods), which indicate clustering precision, and how resulting clusters (or VCs) correspond to a single ICTV genus, respectively (Fig. 2a).Each of these metrics has a value range from 0 to 1 with 1 indicating perfect clustering accuracy and/or coverage.
Compared to the original v1.0's performance (which used MCL at an inflation factor, or IF, of 1.4, see Online Methods), v2.0's CL1, combined with hierarchical clustering, resulted in an overall performance improvement of 28.8% (Fig. 2a).This increase indicated the overall improved ability of the tool to correctly group viruses into their appropriate VC, and how each VC corresponded to its ICTV genus counterpart.Though further feature enhancements also advanced v2.0, we wondered which aspect of our changes most improved performance.To assess this, we further optimized v1.0's MCL-based VC clustering and found that, at an IF of 7, we could achieve nearly equivalent performance (Fig. 2a, Supplementary Table 3) and more VCs predicted by the optimized MCL-based configuration as it organized the 940 viral genomes into 180 VCs, whereas v2.0's CL1 identified 157 VCs.However, higher values in Sep for CL1 indicate better performance for assigning single genera into single VCs, even though MCL at its optimal IF value (i.e., 7) generated more VCs (Supplementary Table 1).Thus, although more VCs were assigned to ICTV genera by the optimized MCL configuration, they were largely discordant VCs of either lumped or split ICTV genera, or both; whereas this behavior was ~50% reduced using CL1 (see Supplementary Fig. 2a and b).Among these 22 lumped or split VCs from the optimized MCL configuration, the virus genomes shared very few proteins (average = 17% range: 1-30%; Supplementary Fig. 1b) similarities, which modern cut-offs would suggest should have been separated as separate genera, here outliers in the network.To better resolve these issues, we added a postprocessing, Euclidean distance-based hierarchical clustering step to split mismatched VCs in v2.0.This step accurately and automatically classified 36 additional genera from the problematic structured VCs (Supplementary Table 2), which increased v2.0's Sep value by 7%.Together, these findings suggested that both upgrading the clustering algorithm and adding hierarchical clustering were critical to improve automatic VC assignments.

Performance assessment of vConTACT v2.0 for complex genomic relationships
Next, we assessed how v2.0 specifically handled areas of the reference network that represent the three broad scenarios that yield discordant VCs (Fig. 1b).First, 55% of ICTV genera are undersampled (Supplementary Table 2), which in a gene-sharing network manifests as weakly connected, small VCs prone to artifactual clustering (Fig. 1b, top row) due to outlier genomes only weakly connected to any given VC.In v1.0, undersampled VCs accounted for 64% (28/44) of all discordant VCs, and could not be resolved by increasing IF values (Fig. 2b and d and Supplementary Table 2).In contrast, v2.0 automatically and accurately handled these same 28 undersampled VCs (comprising 60 genomes) by splitting the 37 problematic genera into 22 outliers (i.e., genera with only one member) and correctly placing the remaining 38 genomes from 15 genera into 15 now concordant VCs (Fig. 2c and d and Supplementary Table 2).Thus, v1.0 performed poorly on under-sampled VCs, whereas v2.0 was able to automatically resolve all under-sampled VCs in this reference network into their appropriate ICTV genera.
Second, we evaluated the ability of v2.0 to handle overlapping VCs (Fig. 1b), which are those VCs that share a greater fraction of genes with members of other VCs than typically expected, presumably due to gene exchange that could erode structure of the network.While overlapping VCs cannot be automatically identified in v1.0, we can detect them in v2.0 through a 'match coefficient' that measures the connection within-and between-other VCs (see Online Methods).Sensitivity analyses established a cluster overlap value of 0.8 as diagnostic (Online Methods).This approach identified nine overlapping VCs (ICTV-classified genera only) containing 30 viruses across 11 ICTV genera.These included viruses with known mosaic genomes 48 (e.g., lambdoid or mu-like phages of the P22virus, Lambdavirus, N15virus, and Bcepmuvirus genera), recombinogenic temperate phages 51,52 (i.e., Mycobacterium phages of the Bignuzvirus, Phayoncevirus, and Fishburnevirus genera and Gordonia phages of the genus Wizardvirus), and three newly-established genera (i.e., Cd119virus, P100virus and archaeal Alphapleolipovirus), all bearing low topology-based confidence scores (averages of 0.32 for these VCs versus 0.52 for concordant VCs; P-value = 6.12e-09,Mann-Whitney U test) (Supplementary Fig. 3a).
Overlapping VCs are also linked to high horizontal gene flow, since most viruses in these VCs were classified as having high gene content variation (HGCF, Fig. 2e, Supplementary Fig. 3b) as assigned by a recently proposed framework of phage evolutionary lifestyles 22 .Though unresolvable in v1.0, v2.0 could assign eight of the 11 ICTV genera (24 viruses) into 8 ICTV-concordant VCs (Supplementary Table 2).The remaining 3 ICTV genera, all comprised of Mycobacterium phages 53 (6 genomes), could not be resolved (Supplementary Table 2), and may not be amenable to automated taxonomy.While these highly recombinogenic genomes represent a minor portion of the viruses classified by ICTV, their prevalence across environments remains to be determined.Third, structured VCs (Fig. 1b, bottom row) contained genomes that our gene sharing networks placed into a single VC due to many shared genes and/or gene modules across all the member genomes, but distributed into several ICTV genera due to subsets of the genomes also sharing additional genes (see Supplementary Note 1).While we showed in v1.0 that these structured VCs could be decomposed through hierarchical clustering 27 , in v2.0, we formalized an optimized, quantitative hierarchical decomposition distance measure for this process (Online Methods and Supplementary Fig. 4).In the v2.0 network, 23 of the 31 discordant VCs (74%) were structured VCs, spanning 86 genera (Fig. 3a,b and Supplementary Table 2).The automatic v2.0 approach resolved 30% (26 of 86) of these ICTV genera from 6 of the 23 structured VCs (Fig. 3c).
Given such strong performance, even with challenging regions of viral sequence space, we suggest that this gene sharing network already offers significant new taxonomic insights.As described earlier, only 41% of the 2,304 reference virus genomes are classified by ICTV at the genus rank, which leaves 1,364 reference viruses that are currently not assigned to a genus.In our networks, these 1,364 reference viruses organized into 404 well-supported VCs (Supplementary Table 2).Of these 1364, 544 were assorted into 104 VCs with genomes from known ICTV taxa, whereas 820 formed 200 separate VCs.We propose that the 820 can be classified and the 200 VCs represent bona fide novel virus genera, to be evaluated as formal ICTV taxa.If adopted, this would immediately double the number of known prokaryotic viral genera, as there are only 264 currently recognized.Beyond providing a starting point for identifying areas of viral sequence space needing revision, the manual curation process itself would provide feedback for improving v2.0.
Hierarchical decomposition of structured VCs into subclusters indicated that the gene content-based distance correctly recapitulated the ICTV taxonomy, but that the cut-offs used to define subclusters are different from the ones currently used to delineate established genera (Fig. 3c and Supplementary Fig. 4).
It is long thought that universal cut-offs may not be appropriate across all of viral sequence space with the result that years of manual curation by experts has resulted in specialized demarcation cut-offs across viral sequence space 54 .However, just has recently been advanced for microbes 55 any steps towards this, such as distances in well-sampled and well-resolved portions of the reference networks, will be invaluable for automating virus taxonomy.The v2.0 VCs and subclusters provided here provide a case study and a reference baseline for working with the ICTV to translate such network-derived cut-offs into systematic taxonomic demarcation criteria.
Finally, there will be some taxon assignments that are not amenable to being resolved by genesharing networks.For example, cases where genera are defined based on phenotypic or evolutionary evidence, e.g., archaeal fuselloviruses 56 (VC42) and bacterial microviruses 57 (VCs 30 and 49), a genesharing network approach will not be appropriate (see Fig. 3c and Supplementary Table 2).However, an automated vConTACT-based approach can help systematically identify such problematic taxa and drastically speed up these critical revisions to our taxonomy as new data become available.

vConTACT v2.0 is scalable to modern virome datasets
Beyond accurate classification, a major bottleneck regarding automated taxonomic assignments is the ability to scalably and robustly integrate large sets of newly discovered virus genomes.To evaluate this, we added 15,280 curated viral genomes and large genome fragments (≥ 10 kb) from the Global Ocean Virome (GOV) dataset 41 to our reference network in successive 10% increments (i.e., 0%, 10%, …, 100% of the total dataset), totalling 16,960 sequences (Fig. 4a).Through this process, we evaluated 2 types of network changes to assess robustness.First, we checked whether the incremental addition of GOV data to the network would lead to changes in node connections, as estimated by the 'change centrality' metrics (CC, values range from 0-1 with 0 indicating no change and 1 indicating complete change, see Fig. 4b).
Second, we evaluated the concordance between v2.0 clustering and ICTV genera using the same performance metrics as above (Sn, Acc & PPV, see Fig. 4c).As seen in Fig. 4b, a large fraction of the data initially experiences a moderate change (CC = 0.4), but the whole dataset eventually stabilized, as CC values for most of the data ranged from 0 to 0.1.A similar trend was observed for accuracy (Acc, Fig. 4c).This indicated that, not only can v2.0 scalably handle thousands of new input sequences, but our original reference network clustering is robust to large-scale data additions such that the underlying reference network remains consistent with established ICTV genera.
Outside of better sampled genera or VCs, we also examined whether GOV data may partially resolve ICTV outlier and singleton genomes as a proxy for assessing the taxonomic ramifications of so much new data.We reasoned that more data might create new connections to singletons such that outliers may become better connected to new or existing VCs.We found that, of 38 single-member VCs of singleton and outlier genomes (Supplementary Fig. 6), three Mycobacterium phage VCs were improved, while two other Mycobacterium virus genomes were merged into larger heterogeneous VCs now composed of six ICTV genera, which did not constitute an improvement.This implies that even though the GOV data derive from an environment very different to that of most of the reference taxa (oceans vs soils and humans), these environmental data could help improve some taxon assignments of isolate genomes.
Separately, looking at the overall taxonomic impact, we observed that 919 new VCs were created with the full GOV dataset (15, 280 total contigs).Given the strong concordance of the 'known' VCs to ICTV genera, we posit that these new VCs represent 919 new viral genera that are not represented among the 264 ICTV genera already known from RefSeq genomes.If true, this demonstrates the power of integrating viral sequence datasets from under-explored environments to better map viral sequence space.
While enticing, we agree with the recent consensus statement that any taxonomic reference network be constrained to complete genomes 24 , and that large genome fragments commonly derived from metagenome-based studies be utilized in a relevant manner to address questions specific to that study.

Community availability and future needs
The utility of v2.0 depends upon its expert evaluation and community availability.In close discussion with members of the ICTV Bacterial and Archaeal Viruses Subcommittee, we made the resulting optimized tool available in two ways.First, the source code is available through Bitbucket (https://bitbucket.org/MAVERICLab/vcontact2)as a downloadable python package.V2.0 is a highly scalable tool, with memory and CPU requirements concomitant with the amount of sequences processed.
The largest memory requirements are by the all-versus-all protein comparisons used to build the protein clusters.Overall, there is a strong linear (R 2 = 0.99, see Supplementary Fig. 7) correlation between number of sequences and runtimes.For example, running the full virus dataset RefSeq with Diamond (for the protein comparisons) would take ~10 minutes on a regular laptop, while a GOV-sized dataset would run for several hours.Second, for ease of use, v2.0 is also available as an app through iVirus 40 , the viral ecology apps and data resource embedded in the CyVerse Cyberinfrastructure, with detailed usage protocols available through Protocol Exchange (https://www.nature.com/protocolexchange/)and protocols.io(https://www.protocols.io/).Finally, the curated reference network is available at each of these sites, and will be updated approximately bi-yearly with as complete genomes become available and resources exist to support this effort.Although v2.0 performance metrics are strong and provide a critically needed, systematic reference viral taxonomic network, limitations remain.First, the complete reference network needs to be rebuilt each time new data are added.Avoiding this reconstruction step will require the development of approximation methods and/or a placement algorithm (akin to PPlacer for 16S phylogenies 58 ) to incorporate new data.Second, CL1-based VC generation may require manual parameter optimization if datasets of an extreme, and probably unlikely, nature are encountered.Such datasets would have to be dominated by overlapping genomes (i.e., those that share a large proportion of genes), and though such genomes exist (e.g., the HGCF Mycobacterium phages), they are currently rare among cultured isolates and their frequency in nature remains unknown.Notably, at least for the oceans, such highly recombinogenic viruses are likely uncommon as the incremental addition of GOV data resulted in a stable network and separate analyses of a more recent version of this GOV dataset found ocean viral populations to be quite structured 59 .However, in case such datasets dominated by overlapping genomes are encountered, we have added an auto-optimization option for determining the optimal distance for hierarchical decomposition of structured VCs in v2.0.Third, solving other future needs will require improved understanding of the broader viral sequence landscape and evolutionary processes.For example, although v2.0 handles reference prokaryotic virus genomes (including ssDNA or dsDNA phages) and large GOV genome fragments, this framework has not been designed, tested or validated for eukaryotic viruses.These viruses will require new solutions as they have broader genomic configurations, such as genome segmentation, overlapping genes, ambisense transcriptional gene configurations, that pose unique computational challenges 35,60 .However, even these viruses can be classified using genome-based metrics (e.g., hidden Markov models protein profiles and genomic organization models) and network analytics, at least at the family-level 35 .Separately, shorter complete prokaryotic virus genomes and small fragments of larger genomes (e.g., ≤ 3 PCs or ≤ 5 genes) are of low statistical power in gene-sharing networks, and will require new solutions to establish higher confidence VCs or remain taxonomically inaccessible via these approaches.Finally, genomes identified as singletons, outliers or overlapping are currently excluded from the gene-sharing network, which leaves a large fraction of viral sequence space unclassified.Although singletons and outliers can be resolved by the addition of new data, overlapping VCs can remain challenging to resolve, particularly for the HGCF phages 22 that are highly recombinogenic.Such mosaic virus genomes are challenging for viral taxonomy.However, they are identifiable in the networks and, at least to date, represent a small fraction of known viral sequence space.Given increased gene flow among temperate phages 22 , it will be valuable to explore viral sequence space in environments where temperate phages are thought to predominate (e.g., soils 61 , human gut 62 ).
In spite of these limitations, vConTACT v2.0 already provides upgrades in performance and its feature set (per-VC confidence scores, user-desired outputs and usability) such that it offers a scalable, robust, systematic and automated means to classify large swaths of bacterial and archaeal virus sequence space.Its limitations are largely surmountable through future research, and coordinated efforts with the research community and the ICTV will only make these gene-sharing network approaches more scalable and broadly applicable.Assuming broad acceptance of these efforts, and parallel efforts with eukaryotic viruses 35 , we might finally have the foundation needed to realize the consensus statement goals 24,27 of establishing a genome-based viral taxonomy to better capture the broader viral sequence landscape emerging from environmental surveys.2).In parallel, the ICTV taxonomy (ICTV Master Species List v1.3, as of February, 2018) was retrieved from the ICTV homepage (https://talk.ictvonline.org/files/master-species-lists/).ICTV-classifications were available for a subset of genomes at each taxonomic rank, and the final dataset included: 884 viruses from two orders, 974 viruses from 23 families, 363 viruses from 28 subfamilies, and 940 viruses from 264 genera.To maintain hierarchical ranks of taxonomy, we manually incorporated 2016 and 2017 ICTV updates 49,64,65 to NCBI taxonomy when ICTV taxonomy was absent.

Generation of viral protein clusters.
Both version 1 and 2 of vConTACT share an identical protein clustering initial step, in which viral proteins are grouped in protein clusters (PCs) through MCL, followed by the formation of viral clusters (VCs) using either MCL (version 1) or ClusterOne (version 2).
First, a total of 231,166 protein sequences were extracted from the 2,304 viral genomes (above).Second, to group protein sequences into homologous protein clusters (PCs) 30 , all proteins were subjected to all-versus-all BLASTP 66 searches (default parameters, cut-offs of 1E -5 on e-value and 50 on bit score).Third, PCs were generated by applying MCL (inflation factor of 2.0), and resulted into all the proteins being organized into 25,513 PCs, with a fraction of proteins (26,625 or 11.5%) as singletons (i.e.isolated protein with no relatives).
Calculating genome similarity between viruses.The resulting output was parsed in the form of a matrix comprised of genomes, PCs and singleton proteins (i.e., 2,304 × 52,138 matrix) (Supplementary Table 1).We then determined the similarities between genomes by calculating a one-tailed P value of observing at least c PCs in common between each pair of genomes, based on the following hypergeometric equation as per Lima-Mendez et al 29 : in which c is the number of PCs in common; a and b are the numbers of PCs and singletons in genomes A and B, respectively; and n is the total number of PCs and singletons in the dataset.The hypergeometric formula calculates the probability of sharing a number of common PCs between two genomes at or above the number (c) under the null hypothesis that the observed result is likely to occur by chance.A score of similarity between genomes was obtained by taking the negative logarithm (base 10) of the hypergeometric P-value multiplied by the total number of pairwise genome comparisons (i.e., (2,304 × 2,303)/2).Genome pairs with a similarity score ≥1 were previously shown to be significantly similar through permutation test where PCs and singleton proteins with genome pairs having a similarity score below the given threshold (negative control) were randomly rearranged.None of the genome pairs in this negative control produced similarity score >1, indicating values above this threshold did not occur by chance 30 .
determine the highest performance on our genome data set (above)To identify the best parameter combination, we used the geometric mean value of prediction accuracy (Acc) and clustering-wise separation (Sep, see next section), as previously described 68 .The final, optimized CL1 parameters were a minimum density of 0.3, a node penalty of 2.0, a haircut of 0.65, and an overlap of 0.8, which resulted in 280 VCs (Supplementary Table 2).
Next, to further decompose 'discordant VCs', we added as a post-clustering step in v2.0, which allows additional hierarchical separation of such VCs into sub-clusters using the unweighted pair group method with arithmetic mean (UPGMA) with pairwise Euclidean distances (implemented in Scipy).To determine the optimal distance for sub-clustering of VCs, we assessed the distances of sub-clusters across all the VCs in the network.We tested the effect of these distances (ranging from 1 to 20 in 0.5 increments) and picked as optimal distance the one which maximized the composite score by multiplying the prediction accuracy (Acc) and clustering-wise separation (Sep) at the ICTV genus rank (see next section).A distance of 9.0 yielded the highest composite score of Acc and Sep (Supplementary Fig. 4).
Notably, vConTACT v2.0 was designed to help users optimize these parameters for grouping of genomes/contigs into VCs and distance for post-decomposition of VCs into sub-clusters.This tool automatically evaluates the robustness of each VCs and sub-clusters, , based on the external performance evaluation statistics (below).
Performance comparison between vConTACT v1.0 and v2.0.Six external quality metrics were used to compare clustering performance between MCL and CL1 68 (Fig. 2a) .Specifically, the performance of v1.0 (MCL) and v2.0 (CL1 alone and CL1 + hierarchical sub-clustering) were evaluated based on : (i) cluster-wise sensitivity, Sn (ii) positive predictive value, PPV (iii) geometric mean of Sn and PPV, Acc (iv) cluster-wise separation, Sep cl (v) complex (ICTV taxon)-wise separation Sep co , and (vi) geometric mean of Sep cl and Sep co , Sep.As an internal parameter, we computed the intra-and inter-cluster proteome similarities (fraction of shared genes between genome that are within the same VCs and different VCs, respectively).For vConTACT v1.0, we only included clustering results which had been determined to yield the highest clustering accuracy value (i.e., inflation factor of 7.0), and this configuration was used for comparison to v2.0's clustering.Therefore, testing each parameter combination (6 performance metrics, for one taxon rank, for 10 clustering results, all cross-compared; i.e., 6 x 1 x 45) resulted in 270 comparisons.
To generate six external measures, we first built a contingency table T, in which row i corresponds to the i th annotated reference complex (i.e., ICTV-recognized order, family, subfamily, or genus), and column j corresponds to the j th predicted complex (i.e., sub-/clusters).The value of a cell T ij denotes the number of member viruses in common between the i th reference complex and j th predicted complex.

Sensitivity:
The sensitivity can be defined as the fraction of member viruses of complex i which are found in sub-/cluster j.
In the formula above, N i is the number of member viruses of complex i.We then calculated the coverage of complex i by its best-matching cluster , as the maximal fraction of member viruses of complex i assigned to the same sub-/cluster by the formula below: The clustering-wise sensitivity was computed as the weighted average of over all complexes.
Higher Sn values indicate a better coverage of the member viruses in the real complexes as: Positive predictive value: The positive predictive value (PPV) indicates the proportion of member viruses of the sub-/cluster j which belong to complex i, relative to the total number of member viruses of the sub-/cluster assigned to all complexes by: where . is the marginal sum of a column j.We calculated the maximal fraction of member viruses of sub-/cluster j found in the same annotated complex , as the prediction reliability of sub-/cluster j to belong to its best-matching complex as: The clustering-wise PPV was then computed as the weighted average of over all sub/clusters by: Higher PPV values indicate that the predicted sub-/clusters are likely to be true positives.
Accuracy: As a summary metric, the Acc can be obtained by computing the geometrical mean of the Sn and PPV values as: Complex-and Cluster-wise separations: With the same contingency table used for Sn, PPV, and Acc, we calculated the relative frequencies with respect to the marginal sums for each row ( , ) and each column ( , ), respectively: , = , ∑ , ⁄ Then the separation is computed as the product of column-wise and row-wise frequencies as:.
The separation values range from 0 to 1, with 1 indicating a perfect correspondence between complex j and sub-/cluster i (i.e., the cluster contains all the members of the complex and only them).Additionally, the separation penalizes the case when member viruses of a given complex are split into multiple sub-/clusters.The complex-wise and cluster-wise values are calculated as the average of over all complexes, and of over all sub-/cluster, respectively: To estimate these separation results as a whole, the geometric mean (clustering-wise separation; Sep) of Sep co and Sep cl was computed: High clustering-wise separation values indicate a bidirectional correspondence between a sub-/cluster and each ICTV taxon: a score of 1.0 indicates that a cluster corresponds perfectly to each taxon.For overall comparison, we used a composite score 50 , calculated by multiplying Acc by Sep.
As an internal measure, the fraction of PCs 30 between two genomes (i.e., proteome similarity) was computed by using the geometric index (G).The proteome similarity was estimated as: in which N(A) and N(B) indicate the number of PCs in the genomes of A and B, respectively.A total of 400,234 pairs of genomes with >1% proteome similarity are shown in Supplementary Table 4.
Clustering-based confidence score.To generate confidence scores for each viral cluster prediction, we used three previously described confidence scoring methods 69,70 , with some modifications.Two of them exploit the network topology properties by assessing the weight of cluster quality and the probability of cluster quality.We then combined these two values as an aggregate topology-based confidence score per VC.For the first scoring method, we computed the quality ( ) of sub-cluster ( ) as: in which and are the total weight of edges that lie within sub-cluster and across others, respectively.For the second method, we evaluated the P-value of a one-sided Mann-Whitney U test for in-weights and out-weights of sub-clusters.The rationale behind this test is that sub-clusters with a lower P-value contains significantly higher in-weights than out-weights, thus indicative that a formed subcluster is valid, and not a random fluctuation.These two independent values, weight of cluster quality and the probability of cluster quality are then multiplied to derive a topology-based confidence score for each cluster.Along with this confidence score, we quantified the likelihood that each sub-cluster corresponds to an ICTV-approved genus (or equivalent) by using distance threshold that are specified at the ICTV genus rank, which we refer to as "taxon predictive score".This score can be calculated as: Specifically, for a sub-cluster ( ) having the genus-level assignment, vConTACT v2.0 automatically measures the maximum distance between taxonomically-known member viruses and calculate the scores by dividing the sum of links having less than the given maximum distance threshold between nodes ( and ) by the total number of links ( ) between all nodes.For a sub-cluster that does not have the genuslevel assignment, v2.0 uses Euclidean distance of 9.0 that can maximize the prediction accuracy and clustering-wise separation (see above) as distance threshold.
Measuring effect of GOV on network structural changes.GOV contigs (14,656 sequences) were added in 10% increments (randomly selected at each iteration) to NCBI Viral RefSeq and processed using vConTACT v2.0 with one difference -Diamond 71 instead of BLASTp was used to construct the allversus-all protein comparison underlying the PC generation.For running this large number of sequences, high-memory computer nodes from the Ohio State supercomputer Center 72 were used.Once generated, vConTACT v2.0 networks were post-processed using a combination of the Scipy 73 , Numpy, Pandas 74 and Scikit-learn 75 python 3.6 packages.Networks were rendered using iGraph 76 .The method to calculate change centrality was calculated as described previously 77 .CCs were calculated in a successive way, in which each addition was compared to Viral RefSeq 85 independently of other additions (0% versus 10%, 0% vs 20%, […], 0% vs 100%).
Data sets.Full-length viral genomes were obtained from the National Center for Biotechnology Information (NCBI) viral reference dataset26,63 ('ViralRefSeq', version 85, as of January, 2018), downloaded from NCBI's viral genome page (https://www.ncbi.nlm.nih.gov/genome/viruses/) and eukaryotic viruses were removed.The resulting file contained a total of 2,304 RefSeq viral genomes including 2,213 bacterial viruses and 91 archaeal viruses (Supplementary Table

Figure 1 .
Figure 1.Virus genome classification visualized as networks.(a)Left side panel: matrix of shared protein clusters (PCs, grey blocks) between a set of virus genomes can be visualized as a network of interconnected nodes, as shown on the right-side of the panel.Each node in this sample 6-node network represents a virus genome that may be connected to other nodes through edges.The edge value represents the strength of connectivity between nodes.If a set of nodes have considerably higher edge weights than the rest of the network they are linked to, these are grouped together to form a viral cluster, or 'VC'.(b) Each row depicts a node clustering scenario in which vConTACT v2.0 has improved upon.On the left side, each scenario is first depicted as a genome-PC matrix highlighting how shared protein clusters between certain genomes may induce erroneous virus groupings due to outlier genomes, overlapping viral groups or VCs containing multiple viral groups.On the right side of the matrices, the topology of each clustering scenario is depicted as small networks of nodes (color-coded according to the ICTV genera colors next to the matrices), and shows how vConTACT version 1 and 2 handled clustering of problematic genomes and/or VCs.(c) Heatmap key corresponding to the various values related to edge weight in (a) and (b), which serve to connect the nodes in the networks and shows how closely related each connected node is to other nodes based on the number of common PCs between genomes.Figure 2. Performance evaluation of vConTACT versions 1.0 and 2.0 on prokaryotic virus genomes.(a) The same colors denote individual performance metrics for the ICTV genera (G) including 940 viral genomes , which are achieved by the Markov clustering (MCL) algorithm at each inflation factor (v1.0)as well as ClusterONE (CL1) and CL1 followed by distance-based hierarchical clustering (CL1 + H) (v2.0), respectively.For more objective comparisons, MCL at an inflation factor of 7.0 followed by hierarchical clustering (IF 7.0 + H) with the same distance (i.e., 9.0) used for v2.0 was included.The total height and number of each bar indicate the composite score for overall performance comparison (Fig.2a).For details, see Online Methods.(b, c) Gene-sharing networks were built using 2,304 archaeal and bacterial virus genomes retrieved from Viral RefSeq v85.Viral clusters (VCs) were obtained by vConTACT v1.0(b) and v2.0 (c) that used MCL with inflation factor (IF) of 7.0 and CL1, respectively (see Online Methods).For both networks, genomes (nodes) are color-coded according to their taxonomic assignments.For example, genomes (only members of the ICTV-recognized genus) that are classified in VCs containing a single ICTV genus are colored in cyan, while genomes found in VCs containing more than two genera are colored in pink.Genomes without ICTV genus affiliation are in grey.Nodes with bold borders indicate those that were correctly identified either as outlier, overlap genomes or separate VCs through v1.0 (b), compared to v2.0 (c).Genomes whose taxonomic assignments and/or annotation are incomplete are colored in yellow, identified through v2.0 (c).For details, see Supplementary Figs3 and 5and Supplementary Table2.(d) Box plots of the percentage of shared protein clusters (PCs) between member viruses within 28 v1.0-generated undersampled VCs having ≥2 genera before (pink), and after (cyan) removal of outlier and/or separation into individual clusters by v2.0.(e) Pie charts depicting the number of overlapping genomes that belong to the high (HGCF) or low (LGCF) gene content flux evolutionary modes or mixed and lytic or temperate phages.Data on the lifestyle and evolutionary modes of 74 viruses were collected from Mavrich and Hatfull22 .For details, see Supplementary Fig.3.Figure3.Application of the hierarchical decomposition to discordant VCs.(a) Distribution of all 31 discordant VCs across the archaeal and bacterial virus gene sharing network, where genomes (nodes) of the given VCs are highlighted in pink and others in grey.(b) Box plots show the fraction (%) of protein clusters (PCs) that were shared within an ICTV genus (i.e., intra-genus proteome similarity) and between multiple genera (i.e., inter-genera similarity) found in each discordant VC including structured clusters whose member genera have similar inter-genera and intra-genus similarities (black dot).(c) Left, A full link dendrogram is represented.Note that the Euclidean distance of nine yielded the highest composite score of accuracy (Acc) and clustering-wise separation (Sep) for sub-clusters from all v2.0-generatedVCs, which was used to split the discordant clusters (Online Methods and Supplementary Fig.4).

Figure 2 . 3 . 3 .
Figure 1.Virus genome classification visualized as networks.(a)Left side panel: matrix of shared protein clusters (PCs, grey blocks) between a set of virus genomes can be visualized as a network of interconnected nodes, as shown on the right-side of the panel.Each node in this sample 6-node network represents a virus genome that may be connected to other nodes through edges.The edge value represents the strength of connectivity between nodes.If a set of nodes have considerably higher edge weights than the rest of the network they are linked to, these are grouped together to form a viral cluster, or 'VC'.(b) Each row depicts a node clustering scenario in which vConTACT v2.0 has improved upon.On the left side, each scenario is first depicted as a genome-PC matrix highlighting how shared protein clusters between certain genomes may induce erroneous virus groupings due to outlier genomes, overlapping viral groups or VCs containing multiple viral groups.On the right side of the matrices, the topology of each clustering scenario is depicted as small networks of nodes (color-coded according to the ICTV genera colors next to the matrices), and shows how vConTACT version 1 and 2 handled clustering of problematic genomes and/or VCs.(c) Heatmap key corresponding to the various values related to edge weight in (a) and (b), which serve to connect the nodes in the networks and shows how closely related each connected node is to other nodes based on the number of common PCs between genomes.Figure 2. Performance evaluation of vConTACT versions 1.0 and 2.0 on prokaryotic virus genomes.(a) The same colors denote individual performance metrics for the ICTV genera (G) including 940 viral genomes , which are achieved by the Markov clustering (MCL) algorithm at each inflation factor (v1.0)as well as ClusterONE (CL1) and CL1 followed by distance-based hierarchical clustering (CL1 + H) (v2.0), respectively.For more objective comparisons, MCL at an inflation factor of 7.0 followed by hierarchical clustering (IF 7.0 + H) with the same distance (i.e., 9.0) used for v2.0 was included.The total height and number of each bar indicate the composite score for overall performance comparison (Fig.2a).For details, see Online Methods.(b, c) Gene-sharing networks were built using 2,304 archaeal and bacterial virus genomes retrieved from Viral RefSeq v85.Viral clusters (VCs) were obtained by vConTACT v1.0(b) and v2.0 (c) that used MCL with inflation factor (IF) of 7.0 and CL1, respectively (see Online Methods).For both networks, genomes (nodes) are color-coded according to their taxonomic assignments.For example, genomes (only members of the ICTV-recognized genus) that are classified in VCs containing a single ICTV genus are colored in cyan, while genomes found in VCs containing more than two genera are colored in pink.Genomes without ICTV genus affiliation are in grey.Nodes with bold borders indicate those that were correctly identified either as outlier, overlap genomes or separate VCs through v1.0 (b), compared to v2.0 (c).Genomes whose taxonomic assignments and/or annotation are incomplete are colored in yellow, identified through v2.0 (c).For details, see Supplementary Figs3 and 5and Supplementary Table2.(d) Box plots of the percentage of shared protein clusters (PCs) between member viruses within 28 v1.0-generated undersampled VCs having ≥2 genera before (pink), and after (cyan) removal of outlier and/or separation into individual clusters by v2.0.(e) Pie charts depicting the number of overlapping genomes that belong to the high (HGCF) or low (LGCF) gene content flux evolutionary modes or mixed and lytic or temperate phages.Data on the lifestyle and evolutionary modes of 74 viruses were collected from Mavrich and Hatfull22 .For details, see Supplementary Fig.3.Figure3.Application of the hierarchical decomposition to discordant VCs.(a) Distribution of all 31 discordant VCs across the archaeal and bacterial virus gene sharing network, where genomes (nodes) of the given VCs are highlighted in pink and others in grey.(b) Box plots show the fraction (%) of protein clusters (PCs) that were shared within an ICTV genus (i.e., intra-genus proteome similarity) and between multiple genera (i.e., inter-genera similarity) found in each discordant VC including structured clusters whose member genera have similar inter-genera and intra-genus similarities (black dot).(c) Left, A full link dendrogram is represented.Note that the Euclidean distance of nine yielded the highest composite score of accuracy (Acc) and clustering-wise separation (Sep) for sub-clusters from all v2.0-generatedVCs, which was used to split the discordant clusters (Online Methods and Supplementary Fig.4).