Biodiversity genomics of small metazoans: high quality de novo genomes from single specimens of field-collected and ethanol-preserved springtails

Genome sequencing of all known eukaryotes on Earth promises unprecedented advances in evolutionary sciences, ecology, systematics and in biodiversity-related applied fields such as environmental management and natural product research. Advances in DNA sequencing technologies make genome sequencing feasible for many non-genetic model species. However, genome sequencing today relies on large quantities of high quality, high molecular weight (HMW) DNA which is mostly obtained from fresh tissues. This is problematic for biodiversity genomics of Metazoa as most species are small and yield minute amounts of DNA. Furthermore, briging living specimens to the lab bench not realistic for the majority of species. Here we overcome those difficulties by sequencing two species of springtails (Collembola) from single specimens preserved in ethanol. We used a newly developed, genome-wide amplification-based protocol to generate PacBio libraries for HiFi long-read sequencing. The assembled genomes were highly continuous. They can be considered complete as we recovered over 95% of BUSCOs. Genome-wide amplification does not seem to bias genome recovery. Presence of almost complete copies of the mitochondrial genome in the nuclear genome were pitfalls for automatic assemblers. The genomes fit well into an existing phylogeny of springtails. A neotype is designated for one of the species, blending genome sequencing and creation of taxonomic references. Our study shows that it is possible to obtain high quality genomes from small, field-preserved sub-millimeter metazoans, thus making their vast diversity accessible to the fields of genomics.


INTRODUCTION
Biodiversity genomics uses genome-scale data to study the molecular basis of biodiversity.
New results already revolutionize life-and environmental sciences by addressing scientific questions about evolution, phylogeny, ecology, etc., assisting the management of biodiversityrelated resources, typically through monitoring some aspects of biodiversity (community composition, intraspecific diversity, gene flow, etc.), and through bioprospecting for new resources, e.g. bioactive compounds. Large-scale biodiversity genomics is increasingly possible today, thanks to rapid advances in DNA sequencing.
Two years ago, the Earth BioGenome Project announced plans to sequence all known ~1.5 M eukaryotic species (Lewin et al., 2018). Regarding the taxonomic distribution of these eukaryotes Robert M. May wrote in 1986: "to a good approximation, all species are insects" (May, 1986).
This points to an important issue in biodiversity genomics: most known eukaryotic biodiversity belongs to small species. Considering only metazoans, several million species remain to be described (Stork, 2018), and we have good reasons to expect that the absolute majority of these are also in the below one mm size range (Stork, McBroom & Hamilton 2015). The ability to genome-sequence small metazoans is thus necessary for biodiversity genomics.
However, minute animals pose some real challenges to genome sequencing, starting with specimens collection. For example, soil microarthropods are generally field-trapped (e.g. with pitfall trap), extracted from their habitat-e.g. with a Berlese funnel (Berlese, 1904) or subsequently improved systems such as Macfadyen (1961). Capture on sight (e.g. with a mouth aspirator) is tedious and used to a lesser extent. Regardless of the method, members of a sampled community are collectively gathered in a collection tube. Keeping small animals extracted from their environment alive is difficult as they are fragile, sensitive to heat, desiccation, and they might prey on or defend from each other. In most instances, there are incompressible delays between specimens collection and DNA extraction: trapping time, field trip schedule and travel time, collected material sorting, identification requiring specimens preparation for microscopy and morphological analysis itself. Unexpected delays might also occur (need to repeat a failed lab work for example). Therefore the specimens must generally be preserved on the spot to avoid their loss during processing time.
It is often the case that only a few specimens of a given species are brought back from a field trip. Also, the possibility of high level of genetic diversity within a population or of sympatric occurrence of divergent genetic lineages within an asexual morphospecies (e.g. Montero-Pau, Ramos-Rodríguez, Serra, & Gómez 2011;Schneider & D'Haese 2013) increases the complexity of obtaining an accurate haploid genome assembly from pooled specimens. Clonal amplification or in-breeding could overcome this issue, but setting up new cultures for each different species is not realistic. Therefore, except for a handful of very common or laboratory-cultured species (such as the springtail Folsomia candida, a soil-dwelling model species sequenced by Faddeeva-Vakhrusheva et al., 2017), working from a large amount of fresh tissue is simply not an option for biodiversity genomics of small metazoans. Genome sequencing from a single, field-preserved specimen alleviates those obstacles.
Several library preparation approaches were developed for short-read sequencing on Illumina platforms from minute DNA quantities (Carøe et al., 2017;Meyer & Kircher, 2010). Short reads alone are not sufficient to produce low-fragmented genome assemblies, so Illumina-sequenced genomes are not sufficiently high quality for many applications (Jarvis, 2016). Today the technical workhorses for highly continuous genomes are long-read DNA sequencers, produced by two companies: Pacific Biosciences (PacBio), and Oxford Nanopore Technologies (ONT). Longread sequencing requires high molecular weight DNA, available in high quantities. The first precondition is obviously not negotiable, but much effort is directed to decrease the amount of necessary DNA. For example, PacBio recently released a "Low input library preparation" protocol, with a recommended requirement of 150 ng DNA (https://www.pacb.com/wp-content/uploads/Procedure-Checklist-Preparing-HiFi-Libraries-from-Low-DNA-Input-Using-SMRTbell-Express-Template-Prep-Kit-2.0.pdf), down to 50 ng DNA (Kingan et al. 2019a). This allows amplification free genome-sequencing of small metazoans in the size range of single mosquitoes (Kingan et al., 2019b(Kingan et al., , 2019c. But as most metazoans are smaller than a single mosquito, this 50--150 ng DNA requirement will generally prove to be already too high. Whole genome amplification (WGA) is frequently used to amplify the genomes of unicellular unculturable organisms (Binga, Lasken, & Neufeld 2008;Woyke, Doud & Schulz 2017), including unicellular eukaryotes (Fischer et al., 2009 (Pinard et al., 2006;de Bourcy et al., 2014). We evaluated these issues by sequencing a pool of several specimens of S. aquaticus on PacBio Sequel II, after a non-genome-wide amplification-based low input library preparation (Kingan et al., 2019b).
D. tigrina (Entomobyomorpha, Isotomidae) is an epedaphous species: upper layer of soil and litter dweller mostly found in anthropized environments (Potapov, 2001). It can be very abundant in vegetal compost, is also found in crop fields (Gruss & Twardowski, 2016), and can occur in caves as a troglophile (Dányi 2011). In western europe it remains active in winter. It has been reported from Macquarie Island (sub-antarctic island), where it is believed to be alien but not found more tolerant to low or high temperatures than the indigenous species (Phillips et al., 2020). S. aquaticus (Symphypleona, Sminthurididae) is an epigeous, hygrophilous species widely spread in the Holarctic region (Bretfeld 1999). The species is common in fresh waters. It gathers on solid ground created by emerging plants, wood and rocks and can efficiently walk and jump on water surfaces thanks to its elongated claws and strong furca (jump appendage) with tip (mucro) shaped as a paddle. The species is also remarkable by its strong sexual dimorphism: the male is significantly smaller than the female and its modified antennae into a prehensile organ allows it to clasp the female antenna in a courtship dance preceding external fecondation ( Fig. 1 B-E). It is a relatively well studied species according to Collembola standards. Its body mass is estimated around 0.05 and 0.14 mg (Eisenberg, 1989). A peculiar postzygotic of sex determination through aberrant spermatogenesis was described by Dallai, Fanciulli & Frati (2004).

PacBio ultra-low input (UL) library preparation
Extraction was performed from a single specimen each time. Specimen was washed in 1 x PBS (Sigma) to remove as much as possible residual EtOH. EtOH was first replaced with PBS, then the solution was replaced four times with clean PBS. Specimen was crushed with one-way pistils (Sigma) then DNA was extracted according to the Qiagen MagAttract kit (Hilden, Germany).
Altogether, we performed eight extractions from S. aquaticus and four extractions from D.
tigrina specimens. Extracted DNA was quantified with the Quantus dsDNA system (Promega) and quality was assessed by FEMTOpulse (Agilent). UL input libraries were prepared using an early access kit kindly provided by Pacific Biosciences. Briefly, gDNA was fragmented with g-Tubes (Covaris) and the resulting fragment size was again inspected by FEMTOpulse (Agilent).
Next, single-stranded overhangs were removed enzymatically, followed by a DNA damage repair, end repair and an A-tailing step. A double-stranded DNA adapter with a T-overhang was ligated for 1 h at 20°C and the resulting products were bead purified (ProNex, Promega), eluted and then split into two identical aliquots. DNA fragments with adapters were amplified by two different PCR Circularity was validated manually, and nucleotide bases were called with a 75% threshold consensus. The mitochondrial genomes were annotated with MITOS2 web server (Bernt et al., 2013). CDS were checked and corrected using Geneious, to ensure that the presence of uncommon start codon and incomplete stop codons did not mislead the automatic annotation algorithms. Boundaries of the rDNAs were slightly adjusted to make them contiguous with the tRNA(val) gene.
To identify insertions of the mitochondrial genome in the nuclear genome (NUMTs) we used ncbi-blastn with a 2x duplicated sequence of the mitochondrial genome as a query (to handle circularity). We recognized the presence of almost complete copies of the mitochondrial genome in the two species, with length equal or superior to the mean length of the CCS (Figure 3). We investigated the quality of the UL assemblies in those locations using IGV  and we applied the following procedure to NUMTs with length > 10 kb : at both extremities of the NUMT, a ~1000 bp sequence with a 500 bp overlap on the contiguous nuclear sequence was selected. All CCS aligning with those sequences were gathered (blastn) and reassembled using Geneious. Matching extremities were bridged by visual recognition of the specific DNA sequence of each NUMT, made possible by the high accuracy of the CCS. Whole genome assemblies were manually corrected by splitting, swapping and resoldering contigs where missaemblies were observed. When a given extremity could not be bridged to another, the contigs were left splitted.

Contamination control
We checked the UL assemblies for contamination using a set of tools. We use Sourmash (Pierce, Irber, Reiter, Brooks & Brown, 2019) to create minHash signatures of each contig and compare them to pre-made search databases for taxonomic assignment of microbial genomes (bacterial, viral and fungal, Genbank LCA Database 2017.11.07, https://sourmash.readthedocs.io/ en/latest/databases.html), using k-mer sizes of 21, 31, 51. We also aligned contigs against the NCBI nr protein database (downloaded on February 7, 2020) with DIAMOND, using long read settings (Buchfink, Xie & Huson, 2015). We evaluated DIAMOND hits with MEGAN 6.18.10 (Huson et al., 2016), using its long read algorithm (Huson et al., 2018). We used the following settings for the MEGAN assignment: minScore 500, maxExpected: 0.01, minPercentIdentify 0, topPercent: 5, minSupportPercent: off, percentToCover 51. Finally, we used ncbi-blastn to query the contigs against the NCBI nr nucleotide database (-max_target_seqs 10 -max_hsps 1 -evalue 1e-25). We used the BlobToolKit pipeline (Challis, Richards, Rajan, Cochrane & Blaxter, 2020) to merge and plot the DIAMOND and BLAST taxonomic assignment with the assembly statistics (GC content and base coverage). Contigs explicitly assigned to anything else than metazoan were excluded from downstream analysis.

Assemblies assessment
All assemblies completeness was assessed using BUSCO v4.0  (Okonechnikov et al., 2016). minimap2 was run with "-H -x map-pb" to map CLR subreads on the L assembly, and "-H -ax asm20" to map CCS on the UL assembly.
backmap estimates genome size from mapped nucleotides divided by mode of the coverage distribution.

Phylogenetic analysis
We downloaded 14 springtails and 1 dipluran genomes assemblies from NCBI (dataset expended from Sun et al. 2020, table 1). We used BUSCO v4.0.6 in short mode and restricted the ortholog search to BUSCO's in the arthropoda_odb10 dataset. We performed gene prediction with AUGUSTUS v2.5.5, trained with the common fly dataset. We screened the obtained BUSCO sets with a custom script to identify genes shared among the species, allowing for 25% missing data.

Species biology and taxonomy
Identification was further validated following Potapov (2001). 17 females and 3 males on 12 slides labelled CSCH-1326-1337 will be deposited to the Apterygota collection of the Senckenberg, Görlitz. 6 females, 2 males and 2 juveniles on 4 slides numbered EA013940-43 will be deposited to the Apterygota collection of the National Museum of Natural History, Paris.
S. aquaticus was the only springtail species forming a population on the pond (i.e. no accidental fall on water surface). Abundantly found in October 2019, it was observed again in June 2020 in large numbers and courtship behavior undergoing (Fig 1D, E). The specimens identification was unambiguous following Bretfeld (1999), Fjellberg (2007) and Stach (1956) descriptions. One male on a slide numbered CSCH-1338 is designated as the neotype for S.
aquaticus (see discussion) and will be deposited to the Apterygota collection at the National Museum of Natural History, Paris, along with 3 females and 2 males on five slides (CSCH-1339-1344) and 20 individuals in 96% ethanol (CS.371, leg. C. Schneider). Three males, three females and one juvenile on 5 slides numbered CSCH-1345-1349 will be deposited to the Apterygota collection at Senckenberg, Görlitz.

Genome statistics estimations from CCS
For D. tigrina a total of 21,3 Gb HiFi data (Q>=20) was produced with a mean read length of 12218 bp with a single 8M SMRT cell. The genome haploid length was estimated to be 167.

L assembly
Flye yielded the best result. This assembly is characterized by a much lower contiguity than the ultra-low assembly, and a high BUSCO (95.8%).
Statistics for all assemblies are reported in table 2, and visuals of UL curated assemblies for both species were created using assembly-stats (version 17.02, https://github.com/rjchallis/assembly-stats) and reported in Figure 2.

Genome annotation
We predicted 17807 proteins in the genome of S. aquaticus, 11593 of which had homologs in other organisms, 6204 were labeled "hypothetical protein". We predicted 21503 proteins in the genome of D. tigrina, 13576 of which had homologs in other organisms and 7927 were labeled "hypothetical protein". We found four predicted genes in the S. aquaticus assembly which were associated with the metallo-beta-lactamase superfamily, and four predicted genes related to aminopenicillanic acid acyl-transferase. Nucleotide and protein searches did not reveal genes homologous of the isopenicillin N synthase-like genes neither in the S. aquaticus, nor in the D.
tigrina genome sequenced with the UL protocol.

Contamination
The SourMash assignment did not match contigs to a viral, bacterial or fungal genome for any of the two species. For S. aquaticus, the DIAMOND + MEGAN-LR assignment and the BlobToolKit pipeline (Megablast + Diamond) were not in complete agreement. The BlobToolKit generally provided more precise taxonomy: 10 contigs either assigned to Plantae species or unassigned by DIAMOND + MEGAN-LR were assigned to Collembola by BlobToolKit pipeline.
2 contigs unassigned by the BolbToolKit pipeline were assigned to Insecta species by DIAMOND + MEGAN-LR. Four small contigs (> 77084 bp) were either without assignment or assigned to Fungi or Streptophyta. A larger contig (212931 bp) was identified as a Cyanobacteria genome fragment. Those 14 contigs were excluded from downstream analysis. For D. tigrina, no evidence for contamination was returned by any approaches.

Mitochondrial genomes and NUMTs
The complete set of 37 mitochondrial genes (13 proteins, 22 tRNA and 2 rRNA coding genes)  Figure 3.

Phylogeny.
We identified 932 single copy orthologs in the genome assemblies. After screening these for  Figure 4.

DISCUSSION
We sequenced the reference genomes of two Collembola species, after we whole-genomeamplified high molecular weight DNA extracts from single specimens. Both species were fieldcollected and preserved in ethanol. The final assemblies were of high contiguity and completeness, on par with recent genomes sequenced on PacBio from single insects (Kingan et al., 2019b(Kingan et al., , 2019c, and also on par with the best reference genome for a Collembola so far, Sinella curviseta, which was DNA sequenced from 500 specimens maintained in culture (Zhang et al., 2019).

Taxonomy
Field collected specimens in poorly known groups often present taxonomical issues, especially linked to cryptic diversity and complexity of species identification. The genus Desoria illustrates this complexity: within the D. tigrina group sensu Fjellberg 2007, D. tigrina andD. grisea (Lubbock, 1869) are two sibling species, described by pioneers of modern Collembola systematics. Desoria grisea was redescribed by Fjellberg (2007) from its type locality. It is distinguished from D. tigrina by consistent chaetotaxy characters on the labial palp. We examined several specimens from our collection spot and recognized all the characters of D. tigrina.
Sequencing from a single specimen prevented the risk of sequencing a chimeric species.
S. aquaticus was originally described from France and has been recognized to be widely spread throughout the Holarctic region. We confirmed that our specimens are identical to the accepted descriptions of S. aquaticus. S. aquaticus was originally described by Bourlet in 1841 probably from the north of France. Bourlet did not make any reference to a type series, and to our knowledge did not preserve any specimens. The population we sampled in Paris is ideal to provide a new reference for this species: the population is abundant, settled, and easily accessible for further studies. The designation of a neotype accompanied with abundant materialspecimens in slides; in ethanol and high quality genomic information -makes for a fine example of integrative taxonomy.

Genomes
The higher level of heterozygosity in D. tigrina compared to S. aquaticus seems consistent with the expected level of isolation of the populations. D. tigrina invaded the compost that was set up one year before the collection. The species is very mobile, being rather large and equipped with a long furca, and gene flow must be active across the nearby surrounding fields and gardens.
On the other hand, the sampled population of S. aquaticus seems rather isolated in a small area (artificial pond in a public garden). The L assembly of S. aquaticus (12 pooled females) is closest to estimated size from the CCS kmer analysis, and yields less duplicated BUSCOs, without any haplotypes purging step. This is possibly explained by more pronounced collapsing of the haplotypes during this assembly based on high-error rates CLR. The haplotig-purged assemblies size for both species are superior to the kmer estimated genome size. Standing alone, the ultra-low protocol can already provide high quality genomes on par with libraries produced without genome-wide amplification. Comparison and combination of the ultralow and low input data sets for S. aquaticus led to two observations: both performed equally in recovering BUSCOs and the ultra-low approach allowed to produce an assembly of higher contiguity. The reads of the low input protocol were relatively short and produced in CLR mode, which can account for the lowest performances. Our results show that the ultra-low protocol does not systemically miss large fractions of the genome and it can be used as a reasonable stand alone approach for single specimen sequencing of small metazoans. The presence of large NUMTs in the nuclear genome led to misassemblies because they can hardly be fully crossed by ~10-12 kb reads and because those regions are extremely similar between each other and to the mitochondrial DNA. This was observed in both species and such large NUMTs could be common in springtails genomes. Combination with long-range positional information could improve assembly by scaffolding and detection of misassemblies.
We could not identify beta-lactam antibiotic genes in the two annotated genomes. This is not surprising in the case of S. aquaticus which does not need to cope with such high microbial load.
Indeed, three of four analysed species from the springtail order Symphypleona do not possess these genes . It is surprising however, that we did not find evidence of these genes in the genome if D. tigrina, which inhabits soils and likely needs to cope with high microbial load. Suring et al. (2017) identified beta-lactam gene clusters in four of six Entomobryomorpha species they investigated, including F. candida, which belongs to the same family as D. tigrina (Isotomidae). Beta-lactam gene clusters are sometimes not recovered from the genomes of hemiedaphic species , and D. tigrina also belongs to this group (Potapov, 2001 (2020) is not strongly supported.
The disagreement in basal relationships is not unexpected, as it occurs in nodes with poor support scores. We do not claim that our extended dataset is relevent in addressing this issue.
Both histories of Collembola evolution is conflicting with the traditional hypotheses based on morphological evidences, and peculiarly the body segmentation: viewed as ancestral in Poduromorpha (all trunk segments distinct), modified in Entomobryomorpha (reduction of the first thoracic segment and sometimes fusion of abdominal segment), and further modified in Neelipleona and Symphypleona (more or less complete coalescence of thoracic and abdominal segmentation and acquisition of a globular appearance). The Symphypleona group was initially created including the genera of Neelipleona (Börner 1901). Massoud (1971Massoud ( , 1976

UL and L comparison
The comparison of genome completeness between the ultra-low input and low input protocol further suggests that the assemblies obtained from the ultra-low input pipeline are of the highest quality that could be expected. Unfortunately, the rather short reads yielded by our low input sequencing limits an in-depth comparison. Nonetheless, the almost identical BUSCO scores indicate that the genome-wide amplified ultra-low sequencing does not result in a biased representation of the genome, which is one of the main concerns of sequencing including a PCR step. Although PacBio sequencing remains more expensive than its short-read sequencing e.g.
with Illumina, the more expensive long, but high fidelity (HiFi) reads provide a viable standalone solution to generate reference genomes. Our results demonstrate that long PacBio reads can now be used to sequence genomes of a much wider part of the Tree of Life, extending biodiversity genomics research to many small metazoans around the one millimeter size range.
Some highly interesting terrestrial taxonomic and functional groups in this regard are beetles, Diptera Hymenoptera, mites, and the soil mesofauna. Beetles were long considered the most biodiverse animal group (Hutchinson, 1959;Zhang et al., 2018), with amazing adaptations and tremendous diversity in the tropics. However, recent estimates suggest that both Diptera (Brown et al., 2018) and Hymenoptera (Forbes et al., 2018) might be considerably more diverse than beetles. This is due to a huge number of undescribed small species. Arachnids are also highly diverse, with potentially over 1 million species, most of which are mites (Stork, 2018). One of the largest reservoirs of terrestrial biodiversity is the soil mesofauna. Those animals, typically defined by sizes between 0.1-2 mm, represent many ancient terrestrial lineages: nematodes, potworms, mites, springtails, proturans, pauropods, insects, isopods, myriapods among others. Small metazoans are important contributors of ecosystem services (pollination, pests and parasites, biocontrol agents, etc. (Scholes, 2018). Genome sequencing is increasingly relevant for monitoring and managing these communities, from detecting underground invasions (Andújar et al., 2017) to chemical-free pest control. For example, genome information is essential to identify targets for RNA interference, an emerging species-specific alternative to chemicals (Vogel et al., 2019). Small metazoans are also important sources of novel natural products (Vilcinskas, 2014).
Information on these products might be encoded in the genomes of the metazoans themselves (Faddeeva-Vakhrusheva et al., 2017;Drukewitz & von Reumont, 2019), or in the microbial symbionts which are co-sequenced with their hosts (Agamennone, 2019;Tobias, 2016). Genomesequencing small metazoans will provide insights into the formation and maintenance of eukaryotic biodiversity and this will open up opportunities for natural resource management and bioprospecting.