An endosymbiont harvest: Phylogenomic analysis of Wolbachia genomes from the Darwin Tree of Life biodiversity genomics project

The Darwin Tree of Life project aims to sequence all described terrestrial and aquatic eukaryotic species found in Britain and Ireland. Reference genome sequences are generated from single individuals for each target species. In addition to the target genome, sequenced samples often contain genetic material from microbiomes, endosymbionts, parasites and other cobionts. Wolbachia endosymbiotic bacteria are found in a diversity of terrestrial arthropods and nematodes, with supergroups A and B the most common in insects. We identified and assembled 110 complete Wolbachia genomes from 93 host species spanning 92 families by filtering data from 368 insect species generated by the Darwin Tree of Life project. From 15 infected species we assembled more than one Wolbachia genome, including cases where individuals carried simultaneous supergroup A and B infections. Different insect orders had distinct patterns of infection, with Lepidopteran hosts mostly infected with supergroup B, while infections in Diptera and Hymenoptera were dominated by A-type Wolbachia. Other than these large-scale order-level associations, host and Wolbachia phylogenies revealed no (or very limited) cophylogeny. This points to the occurrence of frequent host switching events, including between insect orders, in the evolutionary history of the Wolbachia pandemic. While supergroup A and B genomes had distinct GC% and GC skew, and B genomes had a larger core gene set and tended to be longer, it was the abundance of active and pseudogenised copies of bacteriophage WO who was a strong determinant of Wolbachia genome size. Mining raw genome data generated for reference genome assemblies is a robust way of identifying and analysing cobiont genomes and giving greater ecological context for their hosts.


Introduction
The natural world is a complex web of interactions between living species.These interactions can be mutualistic, commensal, pathogenic, parasitic, predatory or inconsequential, but each individual lives alongside a rich diversity of cobionts.Most eukaryotes associate intimately with a specific microbiota, and are commonly infected by a range of microbial and other pathogens.For some microbial associates the distinction between mutualism and pathogenicity or parasitism is fuzzy.For example, Wolbachia (Proteobacteria; Alphaproteobacteria; Rickettsiales; Anaplasmataceae; Wolbachieae) are found living intracellularly in a range of terrestrial arthropods and nematodes.No free-living Wolbachia are known: the association is essential for their survival.In contrast, infection with Wolbachia can be beneficial to hosts, but is not usually essential.
Wolbachia were first identified as mosquito endobacteria that were maternally transmitted, through the oocyte, and that induced a range of reproductive manipulations on their hosts 1,2 .The most common manipulation by Wolbachia is to induce cytoplasmic incompatibility (CI).Under CI, infected females are able to mate productively with all males, but uninfected females are only able to mate with uninfected males (as mating with CI-inducing Wolbacha-infected males results in zygotic death).This asymmetry in fitness can drive spread of the CI-inducing Wolbachia.Other reproductive manipulations include feminisation of genetic males 3 , male killing 4 and induction of parthenogenesis in females 5 .All these manipulations promote the transmission of infected oocytes to the next host generation, and thus boost the spread of Wolbachia.In most species that can be infected, populations are a mix of infected and infection-free individuals, and hosts can evolve to resist infection 6,7 .While Wolbachia are often described as reproductive parasites, association with Wolbachia can sometimes have beneficial effects, providing nutritional supplementation to phloem-feeding hemiptera 8 , and enhancing host immunity to viruses and Plasmodium parasites 9 .Indeed the host immunityboosting phenotype may explain the initial spread of Wolbachia in previously uninfected populations.In nematodes, elimination of Wolbachia induces host sterility, and antibiotic treatment is an effective addition to pharmacological treatment of human-infecting, Wolbachiapositive filarial nematodes 10 .
Wolbachia infection of terrestrial arthropods is very common, with nearly half of all insect species predicted to be infected 11 .Wolbachia can be classified using molecular phylogenetic analyses into a series of supergroups 12,13 .Supergroups C, D and J are found only in filarial nematodes, supergroups E and F are found in both nematodes and insects, and supergroups A, B and S (and others for which full genome data are not available) are found only in arthropods.Supergroups A and B are the most common Wolbachia found in terrestrial insects.
Analysis of Wolbachia biology has been expanded by the determination of genome sequences for many isolates.The genome sequences for Wolbachia from over 90 host species are publicly available, and mining of host genomic raw sequence data identified a large number of additional partial genomes 14 .This understanding, that cobiont genomes can be assembled from the "contamination" present in the data generated for a target host, has been especially useful for the unculturable Wolbachia.We now have the opportunity to survey for the presence of Wolbachia genomes at an unprecedented scale, as the Darwin Tree of Life (DToL) project aims to sequence all described terrestrial and aquatic eukaryotic species found in Britain and Ireland 15 .This project is using high-accuracy long read and chromatin conformation long range sequencing to generate and release publicly available chromosomal genome assemblies, meeting exact standards of contiguity and completeness, for thousands of protists, fungi, plants and animals.Several hundred terrestrial arthropod assemblies are already available (https://portal.darwintreeoflife.org).
The DToL project sequences genomes from individual, wild-caught specimens of target species, and thus will also generate data for the cobiome present in each specimen at the time of sampling.Where possible, DToL processing usually avoids body parts or tissues that are expected to have a high relative mass of cobionts.In smaller-bodied species, where the whole organism is extracted, and in cases where Wolbachia disseminates widely within an organism it is inevitable that cobiont genomes will be sequenced alongside the host genome.
Using k-mer classification tools, it is possible to efficiently and correctly separate out cobiont data from that of the host, and to deliver clean host assemblies [16][17][18] .The cobiont data are then available for independent assembly and analysis.Here we present a survey of the first 368 terrestrial arthropod genome datasets produced in DToL for the presence of Wolbachia, and assemble over 100 new Wolbachia genomes.We use these to explore patterns and processes in bacterial genome evolution and coevolution of Wolbachia with its hosts and with its own bacteriophage parasites.Lepidopteran hosts were mostly infected with supergroup B, while infections in Diptera and Hymenoptera were mainly caused by A-type Wolbachia.However, host and Wolbachia phylogenies revealed no (or very limited) cophylogeny.We show that while B genomes tended to be longer compared to supergroup A, genome size in Wolbachia is correlated with the level of integration of its double-stranded bacteriophage WO.

Screening a diverse set of insect genome data for Wolbachia infections
We screened raw genomic sequence data and primary assemblies for 368 insect species (204 Lepidoptera, 61 Diptera, 52 Hymenoptera, 24 Coleoptera, 9 Hemiptera, 5 Trichoptera, 4 Orthoptera, 3 Ephemeroptera, 3 Plecoptera, 2 Odonata and 1 Neuroptera) generated by DToL for the presence of Wolbachia (Table S1) using the small subunit ribosomal RNA (SSU rRNA) as a marker gene.Wolbachia SSU sequences were detected in 111 (30%) of the species.This degree of infection is similar to previous estimates (ranging from 22% 19,20 to 40% 11 of all arthropods).While the DToL project aims to sequence eukaryotes from across Britain and Ireland, 82% of the samples screened were sampled from the Wytham Woods Ecological Observatory, Oxfordshire (https://www.wythamwoods.ox.ac.uk/) 21.No correlation between sampling location and incidence level was detected, with 29% of all samples collected in Wytham Woods being Wolbachia positive, reflective of the overall incidence level (Figure S1).
Wolbachia prevalence and infection intensity varies between species and between populations within a species 22,23 .As only one individual was analysed for each taxon screened, the true level of infection within the insect biota surveyed by DToL is likely much higher.Incidence was lower in Coleoptera (4/24, 17%) compared to Lepidoptera (55/204, 27%), Diptera (21/61, 34%) and Hymenoptera (23/52, 44%) (Figure 1A).We observed an equal prevalence of infection in samples identified as female (39/138, 28%) and male (45/153, 29%) (Figure 1B).The DToL species were sequenced using PacBio Sequel II HiFi highly accurate long read platform, generating consensus raw reads of 10-20 kb with base level accuracy of >99% (~Q30-40).These long, accurate reads are ideal for assembly, particularly for bacterial genomes where the information content per base is higher than in repeat-rich eukaryotes.The average sequence length of HiFi reads identified as being derived from Wolbachia was 12 kb, indistinguishable from host HiFi reads.We separated and assembled all Wolbachia reads in each positive sample and screened these assemblies to identify complete genomes.We generated 110 complete genomes, from 93 species, of which 77 were circular (Table S2).The average completeness of these genomes, assessed using BUSCO, was 99.3%, with a mean duplication level of 0.37%.The mean genome size of the new genomes was 1.47 Mb, which is significantly larger than the average genome size of public Wolbachia genomes (1.32 Mb; Wilcoxon rank sum test, p-value = 4.576 x 10 -9 ) (Figure S2).This is likely because it is possible to assemble across repeated loci (such as integrated Wolbachia phage) with the long, accurate HiFi reads.The mean number of contigs generated for the 33 genomes that could not be circularised was 2.12 (ranging from 1 to 6).
The dataset includes the first Wolbachia genomes assembled from two insect orders, Odonata (dragonflies and damselflies) and Orthoptera (grasshoppers and crickets).Both species of dragonfly surveyed (Odonata) harboured Wolbachia (Figure 1A).The largest circular Wolbachia genome generated, 2.19 Mb, was isolated from the blue-tailed damselfly.This is the longest complete Wolbachia genome yet reported (Figure 5A).Although in most samples infection by only a single Wolbachia strain was detected, 15 of 93 specimens (16%) were infected with at least two Wolbachia genomes.Within Phalera bucephala (Lepidoptera) and Lasioglossum morio (Hymenoptera) three genomes were assembled, while all other coinfections involved two strains.
Having chromosomally-complete insect host genomes, as well as complete Wolbachia allows for the estimation of the relative numbers of Wolbachia genomes per host genome.Most infected hosts tightly control Wolbachia proliferation and have a relative abundance below ten Wolbachia genomes per host nuclear genome.Particularly high abundances were observed in Thymelicus sylvestris (48 Wolbachia per host) and Athalia cordata (47) (Table S2) (Figure 1C).The mean relative abundance in different taxonomic orders lay between 3 and 12, except for the two crickets (Orthoptera), Chorthippus brunneus and Chorthippus parallelus, which have a 33 and 20 Wolbachia genome copies per host genome, respectively (Figure 1C).No significant difference was observed between relative Wolbachia abundance and sex of the host (Figure 1D), with both male and female having a mean between nine and ten copies.A, B Prevalence of Wolbachia in insect hosts, split by taxonomic order (A) and by sex (B).The cladogram of insect ordinal relationships is based on Misof et al 24 .Orders with more than 10 analysed species are shown in bold.Silhouettes are from PhyloPic (http://phylopic.org/).Sex of insects was classified as F (female), M (male) or U (unknown, where not recorded on collection).

C, D
The estimated number of Wolbachia genomes per copy of the host nuclear genome split by taxonomic order (C) and by sex (D).

Wolbachia phylogeny suggests frequent host switching events
We selected 93 high-contiguity and high-completeness Wolbachia genomes from the public INSDC databases, including genomes from Wolbachia infecting Nematoda (13 genomes), Arachnida (4), Isopoda (1) and several orders of Hexapoda (75) (Table S3).Adding the 110 newly assembled genomes yielded a dataset of over 200 high-quality assemblies.We annotated all protein-coding genes in those genomes using Prodigal 25 , and clustered the predicted protein sets into orthologous groups using OrthoFinder2 26 .The resulting 634 nearsingle copy genes were used to infer a phylogeny of Wolbachia (Figure 2A, Figure S3).From this phylogeny we assigned each genome to the previously defined Wolbachia supergroups 12,13 .All newly assembled Wolbachia genomes belonged to either supergroup A or B. While Lepidoptera were predominantly infected with supergroup B Wolbachia (42/53, 80%), Wolbachia supergroup A was most frequent in all other insect classes (46/57, 81%).It has been previously observed that supergroup B is the most common Wolbachia type in Lepidoptera 19,[27][28][29] .Of the 15 species where co-infections occurred, Endotricha flammealis, Phalera bucephala, Philonthus cognatus, Protocalliphora azurea and Sphaerophoria taeniata were co-infected with strains from both A and B supergroups, and the other ten co-infections were of distinct strains within the same supergroup (Table S2).
Wolbachia generally do not show strict cophylogeny with their hosts 7,23 .This pattern was also observed when comparing host and Wolbachia phylogenies for the supergroup A and B genomes (Figure 2B).Closely related insect species may be infected by dissimilar Wolbachia strains and conversely, closely related Wolbachia can infect a diverse set of insects.For example, the Wolbachia strains infecting the hoverfly Eupeodes latifasciatus and four Lepidoptera (Pararge aegeria, Celastrina argiolus, Hylaea fasciara, Watsonella binaria) (Figure 2C) share over 99% nucleotide identity.Because most of our new samples came from a single site (Wytham Woods Genomic Observatory) we were also able to explore the horizontal transfer of Wolbachia between hosts in a local context.Wytham Woods-derived Wolbachia were no more likely to be related than any other Wolbachia subset (Figure S4).

Figure 2: Wolbachia DToL genomes expand known phylogeny.
A Circular phylogeny of supergroup A and B Wolbachia, visualised with the root placed between the A and B supergroups and the remaining supergroups (C,D,E,F, J, S; nodes collapsed as grey wedge), highlighting newly sequenced genomes (black tip labels) and genomes from public databases (white).
B Incongruence between host topology (left) and supergroup A and B Wolbachia topology (right) is shown as a tanglegram.Overview of the supergroups infecting diverse insect orders is given in a table (inset, bottom right).
C Example of a host switching event, where the Wolbachia of the hoverfly Eupeodes latifasciatus has high nuclear sequence identity and genome colinearity to four Wolbachia genomes assembled from Lepidoptera.

Intrinsic properties of Wolbachia distinguish supergroups
The completeness of the new genomes, and in particular the circular assemblies achieved for 77 of them, permits analyses of genome properties that are not possible with fragmented and partial genomes.All circularised genomes, including those from public databases, were rotated to start at the presumed origin of replication.The average pairwise whole genome nucleotide identity between all Wolbachia genomes ranged between 77.3% and 100.0%, with at least 92.8% and 93.5% identity within supergroup A and B, respectively (Figure 3A).The number of breakpoints interrupting pairwise whole-genome alignments was counted, normalised for the total alignable length, and compared to average nucleotide identity (ANI) of the compared genomes (Figure 3A).A significant correlation was observed between nucleotide divergence and the number of breakpoints in supergroups A (0.90, p < 2.2e -16 , Spearman correlation) and B (0.69, p < 2.2e -16 , Spearman correlation) (Figure 3A).GC skew accumulates in stable bacterial genomes through differential mutation pressures on leading versus lagging strands.Genomes that have undergone frequent rearrangement are expected to have lower overall GC skew, which can be summarised across the genome as a single metric, SkewI 30 .Genomes from supergroups A and B had distinct GC contents (Figure 3B), with supergroup A having a higher mean GC (35.2%, standard deviation 0.15%), compared to B (34.0%, standard deviation 0.16%) (two-sample t-test p-value < 2.2e -16 ).Genomes from other supergroups had distinct GC content, often very different from A and B genomes, but as so few examples have been sequenced, general patterns are not discernible.In both A and B supergroups SkewI values were relatively low, but genomes from Wolbachia from nematode hosts (C, D, J) had higher SkewI values (Figure 3B).A high degree of GC skew had already been observed in the Wolbachia strain infecting Dirofilaria immitis (supergroup C) 31 .This suggests that nematode-associated Wolbachia have retained chromosome stability across a long timeframe, not observed in supergroups A and B, also evident by their large number of re-arrangements (Figure 3A).A Average nucleotide identity (ANI) plotted against the number of breakpoints in comparisons within A supergroup genomes, within B, between A and B and between other supergroup Wolbachia.
B Index of skewness compared to GC content for all circularised Wolbachia genomes.

Conservation and diversity in gene content of Wolbachia
Wolbachia, because they are sheltered within the cells of their hosts, may be relatively isolated from other bacteria, and thus have somewhat closed pan-genomes.One route to acquisition and sharing of new genes is through the Wolbachia phage (WO phage), which alongside the essential phage particle structural genes carry a cargo of genes that have been implicated in host manipulation.We re-annotated all 203 Wolbachia with the same, standard gene finding toolkit, Prodigal, to normalise annotations.While this may have lost careful manual revision in previously determined gene sets, it avoids issues of data incompatibility.Gene number correlated with genome size, and the average gene number in the newly assembled set of supergroup A and B Wolbachia was larger than in A and B genomes from the public databases (Figure S5).Comparing all genomes, the mean number of predicted genes was larger in supergroup B (1467) compared to A (1385).
We used OrthoFinder with default settings to define clusters of orthologous proteins across all Wolbachia genomes.Each genome contained between 0 and 184 novel, strain-specific genes (average 19).These novel genes were shorter than all genes (average gene length overall was 875 nucleotides or ~290 amino acids, while novel genes averaged 434 nucleotides or ~145 amino acids).As expected, supergroups which were not well represented often contained more strain-specific genes.For example, wCfeT from supergroup E (which infects cat fleas, Ctenocephalides felis) uniquely encoded genes for pantothenate (panC-panG-panD-panB) 32 and thiamine (thiG-thiC) biosynthesis.Nonetheless, out of the ten genomes with most strain-specific genes, seven belonged to either supergroup A or B. These novel genes were not preferentially associated with WO phage regions (Figure S6) but the majority (78%) had annotations that associated them with transposon and mobile element function.This suggests that much of the novelty arose through mobile elements other than WO phage.Other than clusters with one or two members, the most frequently observed cluster sizes were 203±2.These clusters contained the single-copy (and near-single-copy) orthologs deployed in phylogenetic analyses (Figure 4A).Overall, the majority of the proteins encoded in the Wolbachia genomes were members of orthology clusters that were present in at least 95% of all strains.
The abundant sampling of supergroup A and B genomes allowed us to address and compare the sizes of the core-and pan-proteomes of these groups.The larger genome and proteome size found in supergroup B was reflected in a larger core proteome (Figure 4B), but supergroup A had a larger pan-proteome (Figure 4B).While the core proteomes differed, very few of the protein families that were part of each supergroup's core proteome were unique to that supergroup.One supergroup-restricted set of protein families was found to comprise the operon for arginine transport (ArtM, ArtQ and ArtP and the repressor of arginine degradation ArgR) 33 , which was uniquely detected and conserved in supergroup A (present in 83/103 or 80% of all Wolbachia A genomes).Although the periplasmic arginine-specific binding protein (ArtI or ArtJ) was not detected, the presence of this ATP-binding cassette-type (ABC) transporter suggests that these Wolbachia are acquiring arginine from their hosts.
The operon producing biotin (vitamin B7) 34 was detected in seven of the 110 new genomes, all belonging to supergroup A (Figure 4C).One derived from Icerya purchasi (Hemiptera) and six were from Hymenoptera (two from Lasioglossus malacharum, which carried two strains, and single strains from three Andrena and a Nomada species).The biotin synthesis cluster has been described previously from a restricted but diverse set of supergroups, including two A genomes from additional Nomada bee hosts.This distribution suggests possible ecological linkage 35 , as Andrena bees are kleptoparasitized by Nomada cuckoo bees and phylogenetic analyses of both the biotin gene clusters and the Wolbachia core proteomes show close relationships between this cluster of genomes (Figure S7).The gene cluster is strongly conserved in physical organisation of all six necessary genes (bioA-D,F,H).In the genomic region immediately surrounding the operon we identified recombinase and transposase genes, as well as ankyrin repeat containing genes and toxin-antitoxin cytoplasmic incompatibility Cin gene pairs.In three genomes (from Andrena dorsata, Nomada fabricium and one of the L. malacharum strains) the operon was independently disrupted by transposases.The region containing the biotin operon thus has the hallmarks of a "virulence island" that may be mobile between genomes, and may have accrued additional genes (ankyrin, Cin) that hitchhike with the biotin operon.B Rarefaction analysis of pan-and core proteomes of supergroups A and B, based on 500,000 random addition-order permutations of co-occurring orthogroups excluding novel genes.
C Synteny of the biotin cluster shows conserved gene order and punctuated pattern of species presence (inset, species with biotin cluster present are highlighted with red circles).

WO prophage insertions expand genome size
Wolbachia can itself be infected by double-stranded DNA temperate bacteriophages, WO phage, which can integrate in the genome of its host as a prophage.Four modules are necessary for construction and function of phage particles during the lytic stage: head, baseplate, tail and fibre, and inserted and pseudogenised WO phage can be identified and discriminated based on the presence and completeness of these components.Regions of a Wolbachia genome flanked by WO phage modules are likely to form components that are transduced by the phage during infection of new cells, "cargo" loci that form the Eukaryotic Association Module (EAM) 36,37 .All the Wolbachia genomes were screened for prophage regions using essential module genes from previously annotated WO insertions (Table S4).Prophage regions were deemed complete when all four modules were observed with at least 80% of genes of each module present.An abundance of putative intact and pseudogenised WO phage were identified.For example the supergroup B Wolbachia from Ischneura elegans (the bluetail damselfly; the largest Wolbachia genome assembled) contained three putative intact prophage and nine WO phage fragments (Figure 5A) summing to 0.8 Mb of the genome.
The fraction of total prophage region in each genome ranged from 0-38%.Nematodeassociated Wolbachia typically are not infected by WO phage 38 and no prophage regions were detected in genomes of supergroup C, D, J and nematode-infecting F (Figure 5B).A significant correlation was found between genome size and WO prophage span in supergroups A and B (Figure 5B).This association was robust to correction for phylogenetic relatedness of the genomes (model fit increased to 0.84 and 0.87 respectively with p-values <10 -16 ).B Wolbachia genome size is strongly correlated with integrated prophage span in supergroups with WO phage association.Phylogenetic Generalised Least Squares (PGLS) analyses were performed to assess the correlation between prophage length and genome size in a phylogenetically-aware manner.

Toxins are often associated with mobile elements
We identified several potential cargo genes within intact and fragmented prophage.These included transposases and integrases associated with mobile elements, and other loci previously associated with eukaryotic manipulation, such as cytoplasmic incompatibility loci and ankyrin repeat containing genes, as expected from the EAM model 36,37 .
Wolbachia produces a suite of toxins 39 which can have dramatic effects on their hosts, such as cytoplasmic incompatibility (CI).The CI phenotype is caused by two adjacent genes, CifA and CifB, which function as a toxin-antitoxin pair 40,41 .Phylogenetic analysis classified most Wolbachia Cif gene pairs into four types (I-IV) 42 .A fifth type (V) is much more variable in structure.The toxin component can have nuclease activity (in which case the gene pair is frequently referred to as CinA-CinB), deubiquitinase (CidA-CidB) or both (CndA,CndB) 43 .All type II, III and IV pairs have nuclease domains, while all type I have deubiquitinase and most have nuclease 42 .Two hundred and sixty one full-length and likely functional Cif pairs were detected in 133 of the 181 (73%) supergroup A and B genomes.One Cif pair was detected in most genomes, but many had several, with seven copies in the Wolbachia strain infecting the holly tortrix moth (Rhopobota naevana).Most of the gene pairs (93) contained a deubiquitinase domain (type I, Cid), while the other four types occurred in roughly equal proportions (II: 40, III: 43, IV:35 and V:50).Many pairs (177/261; 69%) were located in the predicted EAM of the prophage.
Loci encoding additional toxins such as RelE/RelB, Spaid-like and latrotoxin were identified in multiple Wolbachia genomes, frequently in prophage regions (71/586 [12%], 136/597 [23%] and 227/382 [59%] genes, respectively).The Tc pore-forming toxin complex, which consists of two genes TcA and TcB-C, was detected in a limited number of A and B supergroup genomes and also showed a predisposition to occur within prophage (13/69 [19%] and 9/35 [26%], respectively).Additional toxin-encoding loci had limited presence in different subgroups and were not associated with prophage regions.ParD/ParE only occurred in supergroups A, B and E, and FIC only occured in supergroups A, E, F and S. The type IV toxin-antitoxin gene pair AbiEii/AbiGii-AbiEi, which protects against the spread of phage infection 44 , was only detected in two genomes in supergroup E. It is noteworthy that these two genomes had very low levels of prophage-derived DNA (4.3% of their genome span).

Discussion
Isolation of cobiont genomes, and specifically Wolbachia genomes, from shotgun highthroughput sequencing data has been established for many years 45 .In the field of prokaryotic and eukaryotic microbial metagenomics, metagenome-assembled genomes (MAGs) are likely to be the only way to access many unculturable microbial genomes, even if the species they derive from are hyperabundant 46,47 .The abundance of raw sequencing data in the International Nucleotide Sequence Database Collaboration (INSDC) databases has been an attractive prospecting ground for microbial associates of eukaryotic target species.To date, most raw data available for such searches have been short reads from Illumina and other platforms.These reads are too short to partition efficiently into bins corresponding to putative distinct genomes.Preliminary assembly of such datasets is more likely to be able to separate cobionts from target genomes.These approaches have been applied to hunt for Wolbachia with a recent tour de force generating nearly 1,200 Wolbachia MAGs from publicly available data 14 .However, these MAGs suffer from the expected issues of low completeness (due to low effective coverage), fragmentation (due to coverage and sequence repeat issues), undetected contamination and inability to distinguish co-infecting strains.Moreover, the biased nature of public data meant that these derived from only 37 different host species.
We generated 110 Wolbachia assemblies from 368 terrestrial arthropod HiFi datasets, and 77 of these were fully circular genome assemblies.The genomes were uniformly of high completeness (Figure S2).Due to the high intrinsic base quality of HiFi reads (Q30 to Q40; from one error in 1000 to one error in 10,000) we were able to distinguish insertions of Wolbachia DNA into the host genome from true components of the Wolbachia genome, and to independently assemble even closely related strains with confidence.As we were screening raw data from a biodiversity genomics programme that aims to sample a wide phylogenetic diversity of hosts, the new Wolbachia genomes presented here more than double the number of different host species from which Wolbachia genomes have been assembled.The assembled genomes include the first representatives isolated from Odonata (damselflies) and Orthoptera (crickets).In 16 additional datasets we identified likely Wolbachia content but were not able to produce credible genome assemblies (see Supplemental Data, Table S2).This was usually because the Wolbachia sequence was present in very low effective coverage (~ threefold) but in some samples no credible assembly was generated despite high coverage.These datasets may contain multiple recombining strains, or contain large insertions in the host genome and deserve further exploration.
The distribution of Wolbachia in insect hosts is a function of the balance between co-speciation (vertical transmission of Wolbachia among daughters of the host species) and horizontal transmission where strains move between species.Transmission among insect hosts was the dominant pattern underpinning Wolbachia distribution, but we identified two features of the distribution, one local and one general, that are of note.Lepidoptera were more likely to be infected with supergroup B Wolbachia than A, and Hymenoptera, Diptera and Coleoptera were more likely to be infected with supergroup A strains.Multi locus sequence typing (MLST) has previously shown that supergroup B is the most common Wolbachia type in Lepidoptera 19,27- 29 .This suggests some non-exclusive specialisation of Wolbachia on their hosts, which may be driven by Wolbachia genetics, host genetics or (less likely) a distinct set of ecological transmission routes in each insect group.Many of our genomes derived from insects were collected at one site, the Wytham Woods Genomic Observatory (Figure S1) but this subset was no more closely related than other genomes from widely separated sites (Figure S5).It is likely that the mobility of hosts, including through seasonal migration, means that sampling from one geographical site is a valid approximation of more global sampling.
Close ecological association between host species may promote sharing of Wolbachia isolates and localised genetic exchange, for example within predator-prey systems.The close similarity of Wolbachia genomes from Andrena solitary bees and their Nomada cuckoo bee kleptoparasites, and the shared occurrence of the biotin synthesis operon (Figure 4C) may be a case of transmission within an ecological network.The presence of the biotin operon in Wolbachia of insects that largely or solely feed on low-protein plant fluids (nectar or phloem) suggests that the Wolbachia may be offering nutritional support to their hosts 48 , and thus that this cluster of genomes may have been positively selected for their mutualist tendencies.Other genes whose distribution among isolates is driven by horizontal gene transfer, including by mobile elements such as phage, might be expected to have a distribution that is not explained by overall genome relatedness, and might reflect ecological association.We note that previous work has suggested that horizontal transmission rather than cospeciation may also explain closely related Wolbachia in closely related taxa.For example, genomic divergence between closely related Wolbachia in sister Drosophila species was too low to be the product of independent evolution since the last common ancestor of the flies 49,50 .
Wolbachia can promote reproductive success of their hosts 1,2 , and thus their own Darwinian fitness, through reproductive manipulations such as CI.The loci underpinning CI are a diverse set of toxin-antitoxin gene pairs.Our survey of Wolbachia identified many additional CI gene pairs, mainly of the I Cid type and mostly associated with WO phage.Many genomes had more than one toxin-antitoxin pair, and some individual hosts were infected with multiple Wolbachia strains carrying different CI gene pairs.These CI genes likely mediate conflict between Wolbachia strains and the ecosystem of toxin deposition and rescue in individual zygotes must be complex.Interestingly we identified CI gene pairs next to 5 of the 14 biotin synthesis operons, suggesting that the mobile elements that transduce this presumably mutualist physiology are also engaged in CI conflict.
One striking feature of the genomes assembled from the HiFi reads was that their average span was ~10% greater than the average size of previously-assembled Wolbachia genomes.As we also observed a correlation between content of WO phage in the genome and genome size (Figure 5B), we speculate that the lower average size of previous assemblies may be because the presence of near-identical segments of phage and other mobile elements led to collapse of repeats and artificial underestimation of true genome size.This underestimation of genome size may also have biased understanding of WO phage diversity and of the diversity of genes that can be transduced by the phage.WO phage carry genes necessary for production of phage particles and cargo genes that have been hypothesised to form an Eukaryotic Association Module (EAM) 36,37 .The increased genome size and increased resolution of WO phage copies might also mean increased gene content and diversity, and an increased set of common EAM loci.We estimated the pan-proteome of A and B supergroup strains, and found that the A supergroup had a higher pan-proteome but a smaller core proteome than supergroup B. Coupled with the observation of host-association bias between these supergroups, and other major genomic features such as GC proportion, this suggests that these divergent groups have followed very distinct evolutionary trajectories, despite evidence for transduction of loci between supergroups, and perhaps have evolved distinct physiologies and host-manipulation or -cooperation strategies.We note that the average nucleotide identity (ANI) between A and B supergroup strains, and between strains from all supergroups, is relatively low (within-supergroup identity >93%, between supergroup identity <88%).This pattern of significant phylogenetic separation between supergroups suggests, as others have noted, that these supergroups have the features expected of bacterial species 33 .
The Darwin Tree of Life project 15 is one of a growing constellation of biodiversity genomics initiatives worldwide that, under the banner of the Earth BioGenome Project 51 , intend to "sequence life for the future of life" (https://www.earthbiogenome.org).These projects, based around ecological, regional or taxonomic lists of target species, will lay the foundations for biological research, bioindustry and conservation for the next decades.While their focus is to generate reference genomes for eukaryotic species, these projects will also yield critical resources for the study of the microbial cobionts -mutualists, pathogens, parasites and commensals -that live on and in eukaryotic organisms.Our understanding of Wolbachia and other common endosymbionts will thrive on a rich harvest of cobiont genomes from the tens to hundreds of thousands of host genomes that will be generated in the next decade.The assembly of 110 high-quality Wolbachia genomes shows the power of the long read data now being generated and the analytic approach that allowed these low complexity metagenomes to be effectively separated into their constituent parts.Analysis of these genomes revealed a propensity to infect different insect orders among supergroups, while simultaneously pinpointing to several host switching events during the course of the Wolbachia pandemic.Moreover, we observed that genome size in Wolbachia is correlated with the abundance of active and pseudogenised copies of bacteriophage WO.

Methods
Detection and assembly of Wolbachia genomes from DToL species data DToL raw data are generated from whole or partial single specimens, and thus contain sequence from any cobionts in or on the specimen at the time of sampling.We screened data for 368 insect genomes generated by the Darwin Tree of Life project 15 for the presence of the intracellular endosymbiont Wolbachia (Table S1) using a marker gene scan approach by searching for the small subunit rRNA locus.The prokaryotic 16S rRNA alignment from RFAM (RF00177) 52 was transformed into a HMMER profile and the profile was used to screen contigs with nhmmscan 53 .We defined a positive match as having an e-value <10 -150 or an aligned length of >1000 nucleotides.Putative positive regions were extracted from the sequences, and compared to the SILVA SSU database (version 138.1) 54 using sina 55 .Matches were filtered to retain only those with >90% identity.Taxonomic classification of each positive was determined via a consensus rule of 80 percent of the top 20 best hits, using both the NCBI 56 and SILVA 57 taxonomies.
For Wolbachia-positive samples, all PacBio HiFi reads were analysed using kraken2 58 with a custom database consisting of a genome from a species closely related to the host, all RefSeq genomes of Anaplasmataceae and reference genomes of additionally detected cobionts downloaded using NCBI datasets and masked using dustmasker 59 .Horizontal transfer of fragments of endosymbiont and organellar DNA to the nuclear genome is a common phenomenon.To avoid inadvertently classifying nuclear Wolbachia insertions (NUWTs) as deriving from an independent bacterial replicon, Wolbachia reads identified by kraken2 were mapped to the insect genome assembly and only contigs fully covered by these reads were retained.The Wolbachia reads were also independently re-assembled using several assembly tools: flye (version 2.9) (flye --pacbio-hifi {reads} -o {dir} -t {threads} --asm-coverage 50 -genome-size 1.6m --scaffold) 60 , hifiasm (version 0.14) (hifiasm -o {prefix} -t {threads} {reads} -D 10 -l 1 -s 0.999) 61 and hifiasm-meta (version 0.1-r022) (hifiasm_meta -o {prefix} -t {threads} {reads} -l 1) 62 .The several assemblies generated for each sample were ranked based on their completeness using BUSCO version 5.2.2 63 and the Rickettsiales_odb10 dataset, alignment to reference genomes using nucmer (version 4.0.0) 64, evenness of coverage and circularity.The best (most complete, single-contig circular preferred) assembly per sample was chosen.For samples where 10X Genomics Chromium data were available, polishing was performed using FreeBayes-called variants 65 from 10X short reads aligned with LongRanger.The host origin, span and completeness of all Wolbachia detected is presented in Table S2.

Collation of Wolbachia genome dataset, gene prediction and orthology inference
All available Wolbachia genomes were downloaded from NCBI GenBank on 01/02/2022, and supplemented with assemblies generated from short-read insect datasets by Scholz et al. 14 .This dataset contained replicate genomes for very closely related Wolbachia from the same host, and many fragmented and partial assemblies.Only the most contiguous assembly per host species was retained.These genomes were renamed using the schema "R_Xyz_GenSpec_ §", where Xyz is the first three letters of the insect order of the host, GenSpec is an abbreviation derived from the generic and specific epithets of the host, and § indicates the supergroup.Retained assemblies were assessed for the presence of contamination by performing a contig analysis by kraken2 using a database of only circular Wolbachia genomes.A list of all removed contigs can be found in Table S3.Furthermore, we only included database-sourced Wolbachia genomes with at least 90% BUSCO completeness 63 and at most 3% duplication with the Rickettsiales_odb10 dataset (Table S3).The exception to this filtering was the inclusion of genomes belonging to the most divergent supergroup S.
All of the publicly available and newly assembled genomes were annotated using Prodigal (version 2.6.3) 25 .Protein families were inferred using OrthoFinder (version 2.4.0) 26 .We identified 624 protein families which were single-copy in more than 95% of all Wolbachia genomes.These were individually aligned using mafft in automatic mode (version 7.490) 66 .Individual maximum likelihood gene trees were calculated using iqtree (version 2.1.4)(iqtree -s {alignment} -nt {threads}) 67 , and coalescence of these gene trees was determined using ASTRAL (version 5.7.4) 68 .The individual alignments were trimmed using trimAl (version 1.4) 69 , and concatenated to form a supermatrix.This was used to infer a maximum likelihood phylogeny with iqtree using 1000 ultrafast bootstrap approximation iterations (version 2.1.4)(iqtree -s {supermatrix} -m LG+G4 -bb 1000 -nt {threads}) 67 .The insect topology was subsampled from Chesters and al 70 .Incongruence in topology between the insect host and Wolbachia, host phylogeny was determined with ggtree in R 71 .

Intrinsic genomic properties
All circular genomes were rotated to start with HemE (OG0000716) on the positive strand, as this gene is located next to the origin of replication 72 .All pairwise alignments were calculated using nucmer (version 4.0.0) 64, and breakpoints were inferred and adjusted for the aligned coverage.Average nucleotide diversity was calculated using FastANI (version 1.33) 73 .GC and GC skew index values were calculated for all genomes using SkewIT 30 .

WO prophage analysis
A list of known prophage sequences was generated based on annotated regions described in the literature 37,40,78 (Table S4) for a set of genomes (R_Dip_DroSim_A, R_Hym_NasVit_A, R_Dip_DroAna_A, R_Dip_HaeIrr_A, R_Hym_CerSol_A and R_Hym_WiePum_A) and linked to their respective gene families.Each Wolbachia genome was screened for continuous stretches of linked prophage genes with at most five other genes in-between and these were annotated as prophage regions if they contained at least one gene from one of the four core phage modules (head, baseplate, tail, fibre).This permitted detection of novel prophageassociated genes.Regions which contained at least 5 of 6 head, 7 of 8 baseplate, 5 of 6 fibre and 5 of 6 tail module genes were deemed complete.Genomic maps of prophage integration were created with circos 79 .Phylogenetic Generalised Least Squares analyses were performed to assess the correlation between prophage length and genome size using the ape R package 80 , using a Brownian model of evolution and the phylogenetic tree in Figure 2A.R squared values were calculated using the package rr2 81 .

Figure 1 :
Figure 1: Prevalence and relative abundance of Wolbachia in DToL insect genomes.

Figure 4 :A
Figure 4: Exploration of Wolbachia protein-coding gene diversity.A Histogram of protein family size per supergroup.

Figure 5 :
Figure 5: WO prophage in Wolbachia A Annotation of the WO prophage integrated in the genome of the Wolbachia strain infecting Ischnura elegans.