Egoviruses: distant relatives of poxviruses abundant in the gut microbiomes of humans and animals worldwide

Large and giant double-stranded DNA viruses within the phylum Nucleocytoviricota are diverse and prevalent in the environment where they substantially affect the ecology and evolution of eukaryotes1–4. Until now, these viruses were only sporadically found in the digestive system of vertebrates5–7. Here, we present the identification and genomic characterization of a proposed third order of viruses within the class Pokkesviricetes that currently consists of poxviruses and asfuviruses8. Members of this newly identified order we provisionally named “Egovirales” are commonly in the digestive system of vertebrates worldwide and are abundant in >10% of livestock animals, >2% of humans, and wild animals. Egoviruses have linear genomes up to 467 kbp in length and likely form multilayered icosahedral capsids, similar to those of asfuviruses. However, phylogenetic analysis of conserved viral genes indicates that egoviruses are the sister group of poxviruses, with implications for capsid evolution. The diversity of egoviruses already far exceeds that of all known poxviruses and animal-associated asfuviruses. Phylogenetic analyses and patterns of virus distribution across vertebrates suggest that egoviruses can be either specialists or generalists associated with a single or multiple vertebrate species, respectively. Notably, one egovirus clade is human-specific, evolutionarily constrained, and spread across continents, demonstrating a long-term association between Egovirales and the human population on the global scale. Egoviruses not only expand the ecological and evolutionary scope of Pokkesviricetes, but also represent the only diverse, widespread, and abundant group of double-stranded DNA viruses infecting eukaryotic cells in the digestive system of vertebrates.


Introduction:
Pokkesviricetes is a class of eukaryotic viruses with large, linear double-stranded DNA genomes within the phylum Nucleocytoviricota.The class is divided into two orders, Chitovirales (poxviruses) and Asfuvirales 8 .Poxviruses infect a broad diversity of animals including humans 9,10 .They include the now eradicated smallpox virus, monkeypox viruses that continue to sporadically infect humans, molluscum contagiosum virus causing benign tumors in immunocompromised individuals, as well as vaccinia virus, a classic model of molecular virology [11][12][13][14] .Asfuviruses infect hosts ranging from unicellular eukaryotes to animals 15 and include the African swine fever virus, which causes deadly outbreaks in pigs and boars 16 , as well as a wider range of viruses mostly characterized in co-culture with free-living amoeba 17,18 or directly from the environment (e.g., within plankton) using genome-resolved metagenomics 7,19,20 .Aside from the few viruses of global concern, as declared by the World Health Organization, the class Pokkesviricetes is of notable relevance to our understanding of the diversity, ecology, and evolution of eukaryotic DNA viruses 21,22 .
Decades of investigations using cultivation, PCR surveys, serology, and metagenomics have indicated that viruses of the phylum Nucleocytoviricota are mainly transient and thus rarely abundant in the digestive system of humans and other animals (the gut microbiome) [5][6][7] .Until now, not a single Nucleocytoviricota clade had been found to be consistently associated with or prevalent in the gut microbiome, in a sharp contrast with the broad distribution and remarkable diversity and abundance of these viruses in the environment.The scarcity of eukaryotic double-stranded DNA viruses in the gut [5][6][7] could in part be explained by the fact that this environment is overwhelmingly dominated by bacteria and bacteriophages, with the eukaryotic component of the normal gut microbiome being far less diverse [23][24][25][26] .Overall, it has been widely assumed that Nucleocytoviricota viruses are not prevalent in the gut microbiome due to the limited number of eukaryotic cells they would be able to infect in these ecosystems.
Here, we explored the global diversity and distribution of Pokkesviricetes by surveying publicly available metagenomic assemblies from a broad range of environmental samples and animal-associated microbiomes.We discovered a third Pokkesviricetes order (proposed "Egovirales") comprising numerous viruses that appear to be almost exclusively found in the gut microbiomes of humans, livestock, and wild animals.Notably, despite the dominance of bacteria in this environment, we found these viruses to be abundant in more than 2% of the human population.Egoviruses have large, complex genomes and represent some of the most abundant eukaryotic DNA viruses naturally occurring in vertebrates.

Results and discussion
"Egovirales": a third Pokkesviricetes order DNA-dependent RNA polymerases are evolutionarily informative gene markers present in most viruses of the phylum Nucleocytoviricota, including all Pokkesviricetes 8,27 .We used a comprehensive hidden Markov model (HMM) for the largest RNA polymerase subunit A (RNApolA) 27,28 to search 85,123 metagenomic assemblies covering a wide range of environmental samples as well as thousands of samples from the digestive system of humans (e.g., [29][30][31][32][33][34][35] ), livestock and wild animals (e.g., [36][37][38][39][40][41][42][43][44] ) (Table S1).This search identified 1,969,342 RNApolA genes, which we linked to main cellular and viral taxonomic lineages using a custom reference database (see Methods).Among the Pokkesviricetes, we identified a single poxvirus most closely related to Eptesipox occurring in the gut microbiome of a marsupial carnivore from Australia 36 as well as 240 asfuviruses occurring only in the environment, predominantly in aquatic ecosystems (Table S2).These results are in line with our current knowledge on the global niche partitioning of Pokkesviricetes and contribute, alongside previous metagenomic surveys 7,20 , to the rapid expansion of the known diversity of asfuviruses in the environment.In addition to poxviruses and asfuviruses, we identified a clade of previously undescribed DNA viruses related to these two viral orders that were prevalent in the gut microbiome and which we provisionally named egoviruses.We characterized 13 complete, nonredundant egovirus genomes (average nucleotide identity <98%) from the gut microbiomes of humans, buffalos, goats, chicken, deer, and pigs.These linear genomes are 160-360 kbp in length and contain 55 bp inverted terminal repeats, a feature characteristic of Pokkesviricetes [45][46][47] .The egovirus genomes encompass all eight Nucleocytoviricota hallmark genes 8,27 (Table S3), with the DNA polymerase B coding region often split into two separate open reading frames.We built an HMM for the Egovirales double jelly-roll major capsid protein (DJR-MCP) to complement the RNApolA survey and expand the egovirus signal across metagenomes.Overall, we identified genes corresponding to the RNApolA and/or DJR-MCP of Egovirales in 864 metagenomic assemblies, mainly from gut microbiomes of a wide diversity of vertebrates (Table S3).Using these two marker genes as guidance, we recovered a total of 224 metagenomeassembled genomes (MAGs) when including the 13 complete genomes.Egovirales MAGs display an average completion of 81% (minimal quality score of 50%; see Methods) and are up to 467 kbp in length (average of 229 kbp).Phylogenetic analysis of manually curated hallmark genes confidently identified the egovirus MAGs as a distinct major clade within the class Pokkesviricetes that is most closely related to poxviruses (Figures 1 and  S1).Based on these findings, egoviruses should become a third order within Pokkesviricetes that we propose to name "Egovirales".Egovirales is the only clade within the phylum Nucleocytoviricota found to be consistently present in the gut microbiome.

A distinct functional layout of egovirus genomes
The 13 complete Egovirales genomes range from 21% to 42% in GC-content and encompass between 120 and 294 predicted genes (Table S3).We built a genomic database covering Egovirales, Asfuvirales, Chitovirales as well as representatives of the class Megaviricetes 8 , and explored clusters of homologous genes based on amino acid sequence similarity to characterize the gene pool of egoviruses (see Methods).We identified 58 core genes occurring in at least 10 of the 13 complete Egovirales genomes (>75% occurrence), with the majority being present in single copy (Figure 1 and Table S4).Aside from the Nucleocytoviricota hallmark genes, most of the functionally annotated core genes are implicated in genome replication, transcription and associated processes such as DNA repair, precursor synthesis and transcript maturation (various DNA and RNA helicases, endonucleases, DNA topoisomerase type II, thymidine kinase, RuvC-like Holliday junction resolvase, mRNA capping enzyme), transcription regulation, protein degradation (multiple proteases including a metacaspase), and post-translational modification of proteins (ubiquitin-protein ligase and deubiquitinylases).These core genes represent key functionalities of the viruses in the phylum Nucleocytoviricota that in particular are represented in most members of the orders Asfuvirales, Chitovirales and the class Megaviricetes 7,20,28 (Table S4).Apart from the genes that could be functionally annotated, 18 core genes could not be assigned any function through sequence comparisons.However, structural modeling with ColabFold 48 followed by DALI searches against the protein structure database (PDB) allowed functional predictions for three additional core proteins.Notably, one of these was found to represent a histone doublet, likely implicated in genome condensation into nucleosome-like structures.Histones, including histone doublets, have been previously described in some members of the Nucleocytoviricota, in particular, marseilleviruses 49 , but not in members of Pokkesviricetes.The remaining core genes are either exclusive to Egovirales or extremely divergent from putative orthologs in other clades of the phylum Nucleocytoviricota (Table S4).Overall, the 58 Egovirales core genes recapitulate the core functionalities of Nucleocytoviricota (especially regarding DNA replication and transcription) and Pokkesviricetes in particular, while also including uncharacterized functions that could reflect a unique lifestyle adapted to the conditions of vertebrate digestive systems.
Analysis of the virion morphogenesis module sheds light on the key features of the egovirus virions.Egoviruses encode the major components of the virion morphogenetic module conserved across Nucleocytoviricota, including poxviruses and asfuviruses (Figure 1).The core components of this module are the double jelly-roll (DJR) major capsid protein 50 (MCP; Figure 2), the A32-like genome packaging ATPase, and Ulp1-like capsid maturation protease.The DJR-MCP characteristic of Nucleocytoviricota was particularly challenging to identify in Egovirales due to its extensive protein sequence divergence compared to the closest reference proteins.Our initial searches using a dedicated HMM, which successfully recovered DJR-MCPs from all previously described families of the phylum Nucleocytoviricota 27,28 , systematically failed to recognize this protein in egoviruses.Eventually, HHpred 51 profile-profile comparisons yielded a partial match to the DJR-MCP of African swine fever virus, and structural modeling with ColabFold 48 confirmed that the predicted egovirus MCP has the DJR fold (Figure 2).The Egovirales DJR-MCPs are unusually large (average of 806 amino acids among the complete Egovirales genomes) and the region facing the exterior of the capsid is highly diversified among egoviruses (Figure S2).These trends echo the previous observations on the large MCPs of faustoviruses from the order Asfuvirales 17 .However, Egovirales and Chitovirales are firmly established as sister clades in the DJR-MCP phylogeny (Figure S1), despite the substantial protein length differences between the two orders (Figure 2).In addition, egoviruses encode homologs of the major virion proteins exclusive to asfuviruses, namely, structural polyproteins pp220 and pp62.In African swine fever virus, both polyproteins are processed into multiple structural proteins, which play important roles in the formation of multilayered capsids.One of the major proteolytic products of pp220, p150, that is conserved in egoviruses, has an α-helical fold and forms an internal icosahedral shell located inside of a larger, external icosahedral capsid formed from the DJR-MCP 52 .So far, the extent of protein sequence diversification among egoviruses prevented the prediction of the high-confidence 3D structure for the p150 homolog.Structure-based searches allowed identification of the homologs of asfuviral core shell proteins p15 and p35, two of the three proteolytic products of pp62 polyprotein.The p15 and p35 bridge the viral nucleoprotein complex and the internal capsid to the lipid membrane sandwiched between the inner and outer icosahedral shells 53,54 .Collectively, the presence of these three structural proteins suggests that egoviruses form icosahedral multilayered capsids that are similar to those of Asfuvirales 52 , but differ from the brick-shaped capsids of poxviruses 55 .

Egoviruses are prevalent in the gut microbiome of humans and other vertebrates
We used all the RNApolA and/or DJR-MCP genes of Egovirales found among the 864 metagenomic assemblies to perform a global survey of the occurrence and diversity of egoviruses (Table S3).With the RNApolA phylogeny, only 9 egovirus-like RNApolA genes were retrieved from environmental samples, including an aquifer 56 , a cold sweep 57 , and marine sediments 58 (Figure 3).These formed four small clades in the tree, two of which (from marine sediments) being the most basal branches for Egovirales.In contrast, all the remaining RNApolA genes (>97% of the data) originated from gut microbiomes representing the bulk of Egovirales diversity.The DJR-MCP phylogeny recapitulated these trends, and in this case, >99% of the data originated from gut microbiomes (Figure S3).Among the 864 Egovirales-positive samples, egoviruses were detected in humans of various age and geographic location (n=217; positivity rate of 2.2%), with a notable increase of signal among hunter-gatherers from Tanzania (the Hadza; positivity rate of 5.1%) (Figure 4).The extensive sequencing efforts in the study of Hadza populations 29 could in part explain this enhanced signal.Nevertheless, some of the human-associated clades were either exclusively or mostly detected among the Hadza (Figure 3 and Figure S3), suggesting that lifestyle could influence the diversity of human-associated egoviruses.Egoviruses were also well represented in the gut of pigs (n=225; positivity rate of 11%), buffalos (n=193; positivity rate of 27.2%), goats (n=95; positivity rate of 15.4%), cows (n=44; positivity rate of 8.4%), chicken (n=38; positivity rate of 11.2%), and to a lesser extent, also in other livestock for which fewer samples were available (deer, yak, sheep, zebu, reindeer) (Tables S3 and S5).Finally, even fewer metagenomes were available for microbiomes of wild animals.Nevertheless, bamboo rats (rodents), a golden bamboo lemur, an olive baboon, and a white rhinoceros were found to be additional carriers of egoviruses, indicating that primates and other wild mammals also represent a considerable reservoir of egovirus diversity.Buffalos had by far the highest Egovirales positivity rate among the surveyed vertebrate species, with viruses covering a wide range of distantly related clades (Figure 3).The considered cohorts cover seven Chinese provinces and include three buffalo breeds and different segments of their digestive system 37 , allowing for a more refined survey of the occurrence of egoviruses within the gut microbiome of an herbivorous animal species (Table S6).First, we found up to four distinct egoviruses co-occurring in the same sample, contrasting with the human gut microbiome, where no more than one egovirus was detected in the same individual.Second, the four ruminant stomach compartments displayed the highest prevalence of egoviruses: 79% of positive samples for the reticulum, 45% for the abomasum, 38% for the omasum, and 20% for the rumen.Egoviruses were also detected in the cecum (n=19%) and distal colon (n=10%).Finally, the positivity rate reached 25.8% among fecal samples.Based on the fecal samples, egoviruses were detected in all seven Chinese provinces with a positivity rate ranging from 6.7% to 52%, demonstrating the widespread occurrence of Egovirales in the buffalo gut microbiome.
The dispersal of Egovirales clearly indicates that clades of closely related egoviruses tend to occur in the same animal species.This was most apparent for the two main clades associated with humans and pigs (Figures 3 and S3).We characterized the main human clade in 138 metagenomic assemblies from 18 studies that cover various countries from multiple continents, including Tanzania (the Hadza), Nepal, Fiji, USA, England, Sweden, Denmark, Germany, Italy, and Spain (Table S3).This clade alone appears to be abundant in more than 1% of the human population.We found that this clade could (1) be associated with various stages of the human lifespan (e.g., in a 3-year-old child and a pregnant woman) and (2) remain abundant over time in longitudinal studies of the same individuals (up to nearly one year of sampling).Average amino acid identity was 99.4% among the RNApolA genes and 99.9% among the MCP genes, demonstrating a striking genomic stability of this virus clade among humans at a global scale.The main pig clade was found in 109 metagenomic assemblies from 4 studies that also cover countries from multiple continents.Average amino acid identity was 96.7% among the RNApolA genes and 95.4% among the MCP genes, indicating a more diverse clade compared to the one associated with humans.These results demonstrate that closely related egoviruses forming distinct clades can cover very large geographic regions while being strictly associated with a single vertebrate species.In addition, we found instances of closely related egoviruses occurring in multiple vertebrate species.Examples include goat-deer, pig-cow, and deer-buffalo pairs (Figures 3 and S3).Overall, the results demonstrate the worldwide spread and strong preference of egoviruses towards one or multiple vertebrate species.This congruence between the phylogenetic signal and distribution patterns of egoviruses appears to reflect coevolution with the vertebrate hosts (direct or indirect) and echoes trends observed with well-studied animal-infecting viruses, such as the poxviruses 59 , papillomaviruses 60 , and foamy viruses 61 .

Unicellular eukaryotes in Egovirales-positive metagenomic assemblies
Finally, we assessed whether the detection of egoviruses in the gut samples coincides with the presence of any particular group of unicellular eukaryotes, which could serve as egovirus hosts.We identified eukaryotic RNApolA genes in 225 out of the 864 Egoviralespositive samples.Two clades accounted for nearly 95% of this signal: the genus Blastocystis (a Stramenopile) and a larger group of ciliates (Figure 3; Table S7).Blastocystis is highly prevalent in the digestive system of humans [62][63][64] and various animals 65,66 , whereas ciliates are prevalent in the rumen microbiome of herbivorous mammals (e.g., cow, sheep, buffalo) [67][68][69][70] and can also occur in other vertebrates such as pigs 71,72 .The limited eukaryotic diversity co-detected with egoviruses across metagenomic assemblies suggests that Blastocystis and ciliates are the only two unicellular eukaryotic host candidates.Blastocystis was detected in 137 of the 864 Egovirales-positive samples, mainly in humans (n=65), pigs (n=52) and chicken (n=14).In addition, we only detected the ciliates in 87 out of 864 Egovirales-positive samples, mainly in pigs (n=40), cows (n=20) and buffalos (n=11).The correlation between the mean coverage of egoviruses and Blastocystis, ciliates, or the remaining eukaryotic signal found across all 864 Egovirales-positive samples was weak (Figure 4 and Table S8).In one striking example, the metagenome with the highest mean coverage for an egovirus (nearly 500X coverage in a chicken fecal sample) had no eukaryotic signal co-detected by means of our RNApolA survey.Overall, at the scale of all vertebrate species (Figures 3  and 4; Tables S5, S7 and S8), no clear patterns could be found among the metagenomic assemblies that would support one unicellular eukaryotic clade as being the prime host for egoviruses.

Discussion
Our global metagenomic survey of viruses of the class Pokkesviricetes (phylum Nucleocytoviricota) led to the discovery of a putative third order, "Egovirales", characterized by linear genomes that can reach 467 kbp in length, and occur mostly in the gut microbiomes of humans, livestock, and wild animals.Egoviruses have a highly divergent major capsid protein (DJR-MCP) and likely produce viral particles with an icosahedral multilayered structure resembling the virions of African swine fever virus.Egoviruses not only fill major gaps in our knowledge of the ecology and evolution of Pokkesviricetes, but also represent the only so far discovered group of viruses with large double-stranded DNA genomes that is consistently detected in human and animal gut microbiomes.Given its evolutionary trajectory and distribution patterns, this previously overlooked major group of viruses with complex genomes is of prime interest to human and veterinary virology.
Until now, the gut microbiome has been considered to encompass a limited diversity and abundance of eukaryotic DNA viruses [5][6][7] , with the most prevalent ones, such as anelloviruses 73 , having small, single-stranded DNA genomes of only a few kilobases.Egoviruses with their large genomes are changing this view, being projected from the overall metagenomic signal to be abundant in the digestive system of ~180 million humans and billions of livestock and wild animals worldwide.These viruses were discovered by metagenome analysis, an approach that only targets the most abundant viral genomes and were detected by this method alone in more than 2% of the analyzed human gut microbiomes and in even higher fractions of many animal microbiomes.In particular, a substantial proportion (possibly, a majority) of the ~200 million buffalos contain one or multiple egovirus species thriving in their stomach compartments.The diversity of egoviruses was found to be particularly high among buffalos and is likely far from being fully characterized in most vertebrate species, including humans.In addition, longitudinal studies indicate that egoviruses can remain abundant over long periods of time in the gut of the same human individual.Thus, egoviruses are diverse, widespread, and resilient viral members of the gut microbiome.
Apart from their likely ecological importance in vertebrate digestive systems, egovirus genome analysis helps to clarify evolutionary events that shaped the diversity and functionality of viruses in the class Pokkesviricetes.First, the evolutionary association between poxviruses and asfuviruses has been debated in part due to long-branching attraction artifacts often associated with poxviruses in phylogenies 27,74 , so the discovery of a third order (Egovirales) evolutionarily related to both Asfuvirales and Chitovirales cements the monophyly of the class Pokkesviricetes.Second, egoviruses provide insights into the nature of the differences in the virion structure between the two previously identified orders of Pokkesviricetes, namely, multilayered icosahedral capsids (Asfuvirales) versus brick-shaped capsids (Chitovirales).On the one hand, we found the DJR-MCP of egoviruses to be most closely related to those of poxviruses despite substantial protein length differences and the fact that, in poxviruses, the DJR-MCP is incorporated into virion assembly intermediates but not into the mature virions [75][76][77] .Overall, phylogenetic analysis strongly favors an evolutionary scenario in which Egovirales and Chitovirales evolved from a common ancestor, with Asfuvirales diverging earlier in evolution.However, asfuviruses and egoviruses are the only Nucleocytoviricota clades known or predicted to harbor an inner MCP.These observations indicate that the last common ancestor of Pokkesviricetes had a multilayered icosahedral capsid that was replaced by brick-shaped particles in the ancestors of poxviruses following their split from egoviruses.Finally, the global distribution of egoviruses in gut samples further emphasizes the strong evolutionary association between viruses of the class Pokkesviricetes and vertebrates.More broadly, an evolutionary and ecological divide is now clearly emerging within the phylum Nucleocytoviricota between the class Megaviricetes that is mainly associated with protist hosts inhabiting diverse environments and the now substantially expanded class Pokkesviricetes enriched in animal-associated viruses.
One critical question is the nature of the host cells infected by the egoviruses.Although metagenomics alone does not offer conclusive answers, there are insights that merit consideration.Our data show that egovirus clades are spread across continents and display a strong preference towards one or multiple vertebrate species, resembling poxviruses in that respect, but not asfuviruses 59 .One clade of closely related egoviruses is strictly associated with humans from different parts of the world, suggesting a longlasting association between Egovirales and humans.Given the remarkable differences in lifestyles among global human populations that are well reflected in the gut microbiome 78,79 , the presence of egoviruses in humans from both industrialized and nonindustrialized countries makes an exogeneous or transitionary member of the gut biome an unlikely host for egoviruses.Thus, it appears likely that egoviruses infect either animal cells or unicellular eukaryotes that broadly reside in the animal digestive systems.Eukaryotic signal in Egovirales-positive metagenomic assemblies suggests two potential unicellular eukaryotic hosts, namely, the evolutionarily constrained Blastocystis genus, and a more diverse group of ciliates.We found that in the guts of humans and pigs, Blastocystis was much more likely to be detected in Egovirales-positive samples compared to Egovirales-negative samples.However, buffalos not only displayed the opposite trend, but also provided a critical perspective, having the highest Egovirales signal among the surveyed animals but a very limited detectable co-occurrence of these viruses with unicellular eukaryotes.Furthermore, the relative abundance of egoviruses was at times substantial but did not significantly correlate with that of Blastocystis, ciliates, or the other detected unicellular eukaryotes.Thus, there are no indications that egoviruses infect unicellular eukaryotes.An alternative and perhaps more likely possibility is that egoviruses infect animal epithelial cells of the gastrointestinal tract that are shed constantly and abundantly from the gut barrier 80,81 .

Conclusion
Egoviruses are DNA viruses with large genomes that form a third order-level clade in the class Pokkesviricetes (phylum Nucleocytoviricota), along with asfuviruses (order Asfuvirales) and poxviruses (order Chitovirales) to which egoviruses are most closely related.We propose to classify the egovirus clade as the order "Egovirales".In a sharp contrast to both asfuviruses and poxviruses, egoviruses are highly prevalent in the guts of vertebrates and are abundant in more than 2% of the human population.We anticipate that the known diversity of egoviruses will soon far exceed that of all poxviruses and animal-associated asfuviruses combined, transforming our assessment of animalassociated eukaryotic DNA viruses in general and the diversity of the prominent class Pokkesviricetes in particular.

Methods:
Metagenomic database: Metagenomic sequencing datasets were processed as previously described 82,83 to extract single-copy marker genes from metagenome assembled genomes to build the database for the taxonomic profiling tool mOTUs 84 .Briefly, BBMap 85 (v.38.71) was used to quality control sequencing reads from all samples by removing adapters from the reads, removing reads that mapped to quality control sequences (PhiX genome) and discarding low-quality reads (trimq=14, maq=20, maxns=1, and minlength=45).Quality-controlled reads were merged using bbmerge.shwith a minimum overlap of 16 bases, resulting in merged, unmerged paired, and single reads.The reads from metagenomic samples were assembled into contigs using the SPAdes assembler 86 (v3.15.2) in metagenomic mode.Contigs were length-filtered (≥ 500 bp) and gene sequences were predicted using Prodigal 87 (v2.6.3) with the parameters -c -q -m -p meta.
A first metagenomic survey using the RNApolA: A broad-spectrum HMM profile for the RNApolA hallmark gene 27 was searched against the predicted gene sequences from the metagenomic database using hmmsearch (HMMER 88 3.3.1)with an e-value cutoff of 1e -50 .This search resulted in the identification of nearly two million RNApolA genes.Positive hits were then linked back to their source contig.
Taxonomic affiliation of RNApolA sequences.We collected amino acid sequences corresponding to a broad range of reference RNApolA genes for bacterial, archaeal, eukaryotic, and giant virus lineages characterized by means of cultivation and environmental geonomics 28,89,90 .We generated a custom diamond database 91 (v.2.1.8)from this resource and summarized the taxonomic affiliation of each sequence in a complementary metadata file.We then performed a diamond blast search of RNApolA genes identified from the metagenomic survey using this custom database and exploited the associated metadata file to assess both their taxonomic affiliation and novelty score (based on the percent identity).
Phylogenetic signal pointing towards egoviruses.We used cdhit 92 and a 70% identity cut-off at the protein level to remove redundancy among the nearly two million RNApolA genes identified from the metagenomic survey.This step provided a database of 17,293 unique RNApolA proteins longer than 250 amino acids that encapsulate a wide range of cellular and viral diversity recovered from the metagenomic survey.Then, we used the phylogenetic workflow and visualization capabilities of anvi'o 93,94 (v.8) using muscle 95 and fast-tree 96 to identify and remove RNApolA proteins corresponding to bacteria and plastids.Focusing on the 10,853 remaining unique RNApolA proteins, we iteratively used the same anvi'o workflow but this time to identify and remove long branching artifacts.A total of 13 iterations were performed, after which 8,672 unique RNApolA proteins remained.Using this curated database, we explored the phylogenetic signal of RNApolA proteins using taxonomic annotations and the novelty score for guidance.This led to the identification of a monophyletic clade corresponding to the egoviruses, which is most closely related to asfuviruses and poxviruses.We incorporated these sequences in our custom diamond database and performed a second diamond blast on the nearly two million RNApolA proteins, in order to identify Egovirales RNApolA proteins (percent identity >50%) and study their source contigs.This methodology recovered 13 complete Egovirales genomes.Functional annotation of Egovirales core genes.We collected reference culture genomes and metagenome-assembled genomes corresponding to Asfuvirales and Chitovirales, as well as few Megaviricetes representatives from the GVDB collection 8 .We added the 13 complete Egovirales genomes into this "Nucleocytoviricota genomic resource" and used Orthofinder 97 (V.2.5.5 in "diamond_ultra_sens" mode) to characterize gene clusters.Gene clusters occurring in at least 10 out of the 13 Egovirales complete genomes were labelled as "Egovirales core genes", and those occurring as a single copy in at least 10 out of the same 13 genomes were labelled as "Egovirales single copy core genes".Finally, HHpred 51 profile-profile comparisons and ColabFold 48 (3D structure predictions and comparisons) were used to functionally annotate the Egovirales core genes.In the case of the DJR-MCP of Egovirales, ColabFold 48 successfully predicted its 3D structure and confirmed the presence of the double-jelly roll (DJR) protein components characteristic of the phylum Nucleocytoviricota.
Recovery of eight Nucleocytoviricota hallmark genes.We used HMMER 88 v3.1b2 to search and collect the eight hallmark genes of Nucleocytoviricota from the Nucleocytoviricota genomic resource (see "Functional annotation of Egovirales core genes" section).Given that genomes of faustoviruses from the GVDB collection 8 had partial DJR-MCP genes, we replaced those with 7 strains of faustoviruses from the NCBI GenBank collection (strains E9, LCD7, M6, S17, liban, vv10, and vv63; https://www.ncbi.nlm.nih.gov/genbank/accessed February 2024).We also added the Variola virus isolate VARV V563 genome from the GOEV database 28 , and replaced the Megaviricetes representatives initially picked from the GVDB collection by others in the GOEV database 28 based on their distribution as well as the occurrence of hallmark genes.In addition to the RNApolA and DJR-MCP protein sequences already recovered, we obtained sequences for the family-B DNA polymerase (DNApolB), the second largest DNA-dependent RNA polymerase subunits (RNApolB), the transcription factor IIS (TFIIS), the D5-like primase (Primase), the poxvirus late transcription factor (VLTF3) and the packaging ATPase (pATPase).The DNApolB gene, often present on two separate open reading frames in egoviruses, required manual curation through BLASTp alignments (BLAST 98 v2.10.1) and phylogenetic reconstructions, as previously described 28 .Briefly, when a single egovirus genome had two sequences for this gene, their positions were inspected in a maximum-likelihood single-protein phylogenetic tree, and their sequences were aligned with BLASTp to a close reference sequence and to each other.The two DNApolB sequences for a same egovirus systematically corresponded to a clear split: they were fused and accordingly labelled for further notice.Finally, we created a database of genomes containing at least seven out of the eight hallmark genes, for downstream phylogenetic analyses.For the global phylogenies of the RNApolA and DJR-MCP of egoviruses (Figures 3 and S3), the database was restricted to egoviruses and representatives from the Chitovirales and Asfuvirales for rooting.

Alignments, trimming and individual gene phylogenetic analyses.
From the genomic database created in the previous section (see "Functional annotation of Egovirales core genes"), protein sequences corresponding to the same hallmark gene were aligned with MAFFT 99 v.7.464 and default parameters.The L-INS-i algorithm was used.Sites with more than 70% gaps were systematically trimmed using Goalign v0.3.5 (https://www.github.com/evolbioinfo/goalign). The maximum-likelihood phylogenetic reconstructions were performed using IQ-TREE 100 v1.6.2, with the ModelFinder 101 option to determine the best-fitting model.Supports were computed from 1,000 replicates for the Shimodaira-Hasegawa (SH)-like aLRT 102 and UFBoot 103 .Supports were deemed good when SH-like aLRT ≥80% and UFBoot ≥95%, as per IQ-TREE manual, and moderate when only one of the two conditions was met.The few cases of hallmark genes occurring more than once in each genome were investigated and resolved after removing recent paralogs and clear outliers.All phylogenetic trees (Figure S1 for the eight individual tree phylogenies; Figure 3 and Figure S3 for the global phylogenies of the RNApolA and DJR-MCP of egoviruses) were visualized and rooted using anvi'o.
A second metagenomic survey using the DJR-MCP: We collected DJR-MCPs from Egovirales complete genomes to generate with HMMER 88 (v3.1b2) an HMM dedicated to the Egovirales DJR-MCP.This HMM was then searched against all predicted gene sequences from the metagenomic database, using hmmsearch (HMMER 88 v.3.3.1) and an e-value cutoff of 1e -10 .Positive hits were then linked back to their source contig, expanding the scope of Egovirales-positive metagenomic samples.
A database of 224 egovirus MAGs.We analyzed all contigs containing an RNApolA gene or DJR-MCP gene affiliated to Egovirales in the context of their associated bins, which were generated using Metabat2 104 with differential coverage information, as previously described 82,83 .We removed contigs devoid of any Egovirales core gene found to be specific to this order (n=42), a stringent step to only work on high-quality viral MAGs.In addition, we removed MAGs with a quality score (completion minus redundancy) below 50% after creating a collection of 51 Egovirales single-copy-core genes and processing the database with anvi'o 93,94 (v.8.1) to predict gene sequences and functionally annotate hallmark genes using specific HMMs.
A phylogeny of Nucleocytoviricota using concatenated gene alignments.In addition to the single-protein phylogenies, we also performed a multi-protein phylogenetic analysis of the concatenated DNApolB, RNApolA, RNApolB and TFIIS protein sequences.We replaced the few Megaviricetes representatives used as outgroup by a much larger genomic database for the non-Pokkesviricetes nucleocytoviruses: we used gene markers for the DNApolB, RNApolA, RNApolB and TFIIS corresponding to the classes Megaviricetes (exclusion of Medusavirus to minimize long-branching artifacts) from the GOEV database 28 , which were all already carefully manually curated.We added to this resource protein sequences for the DNApolB, RNApolA, RNApolB and TFIIS corresponding to all 224 Egovirales MAGs.For the DNApolB, RNApolA, RNApolB, and TFIIS, only sequences longer than 100, 200, 200, and 25 amino-acids respectively were kept.This phylogenetic analysis followed the protocol described in the previous section, except that the FFT-NS-i algorithm of MAFFT v7.464 was used for the alignment of each individual protein dataset, before they were trimmed and concatenated.The phylogenetic tree (Figure 1) was visualized and rooted using anvi'o.

Metagenomic coverage values for egoviruses and co-occurring eukaryotes:
We generated a non-redundant database (average nucleotide identity <95%) of egovirus MCP genes, egovirus RNApolA genes, and eukaryotic RNApolA genes characterized from the Egovirales-positive samples.We performed read recruitment of all the Egoviralespositive metagenomic samples against the database.For that, quality controlled reads were mapped (Bowtie2 105 , v2.4.116) against the database and the 'anvi-profile-blitz' 93,94 routine used alignments to generate mean coverage values.Taxonomic information was included using the diamond blast for RNApolA genes (see "Taxonomic affiliation of RNApolA sequences" section) and the phylogenetic tree for MCP genes (see "Alignments, trimming and individual gene phylogenetic analyses" section).
Data availability: Databases our study used or generated include (1) the RNApolA proteins identified from the metagenomic survey, (2) Egovirales RNApolA and DJR-MCPs proteins, (3) reference Nucleocytoviricota HMMs as well as two new HMMs targeting the DJR-MCP and VLTF3 of Egovirales, (4) HMMs for the corre genes and single copy core genes of Egovirales, (5) the 224 Egovirales MAGs (including complete genomes), ( 6) the 3D structure of the Egovirales DJR-MCP, (7) and the non-redundant database of MCP and RNApolA genes covering the diversity of egoviruses and eukaryotes found in the Egovirales-positive samples.These items along with the supplementary tables (Tables S01-S08) have been made publicly available at https://doi.org/10.6084/m9.figshare.25398895.v7.

Figure 1 :
Figure 1: Evolution and predicted gene functions of egoviruses.Left panel displays a maximumlikelihood phylogenomic analysis of members of Nucleocytoviricota, along with 202 Egovirales genomes with a quality score above 50%.The tree covers 1,807 genomes and was built using concatenated multiple alignments of four hallmark proteins of Nucleocyovoricota (DNApolB, RNApolA, RNApolB, TFIIS; 4,110 sites) using IQTree with the LG+F+R10 model.The tree was rooted with the putative class Proculviricetes and visualized using anvi'o v8.Green dots at nodes in the tree highlight high supports (approximate likelihood ratio (aLRT) ≥ 80 and ultrafast bootstrap approximation (UFBoot) ≥ 95; see Methodds).Right panel summarizes the functional annotation of 58 Egovirales core genes.

Figure 2 :
Figure 2: The Egovirales major capsid protein.Top panel displays structural modeling of an Egovirales DJR-MCP (human main clade) and comparison to the structures of DJR-MCPs from three reference Pokkesviricetes DJR-MCPs (from left to right: 3sam, 6ku9, 5j7o).Bottom panel displays boxplots summarizing variation in protein length for the DJR-MCPs of egoviruses (complete genomes), faustoviruses, other clades of asfuviruses, and Chitovirales.

Figure 3 :
Figure 3: Overlay of the evolutionary tree of egoviruses and their occurrence among vertebrate species.The figure displays a maximum-likelihood phylogeny of the RNApolA of egoviruses (n=324), using representatives of Chitovirales and Asfuvirales for rooting (1,357 sites), and the LG+F+R7 model.Layers of information summarize the occurrence of the egovirus genomes >50 kbp in length (including the complete genomes), ecosystem type (environment vs. human vs. other animals), and co-occurrence of Blastocystis and ciliates.Finally, selections were made for egovirus clades occurring in the same animal species.

Figure 4 :
Figure 4: Prevalence of egoviruses and co-occurring unicellular eukaryotes.Top panel summarizes the positivity rate for egoviruses, Blastocystis and ciliates in gut metagenomes from five vertebrate species.Bottom panel displays the mean coverage correlation between egoviruses and clades of unicellular eukaryotes across metagenomes corresponding to the 864 Egovirales-positive samples.