Optimized DNA extraction and library preparation for minute arthropods: Application to target enrichment in chalcid wasps used for biocontrol

Target enrichment is increasingly used for genotyping of plant and animal species or to better understand the evolutionary history of important lineages through the inference of statistically robust phylogenies. Limitations to routine target enrichment are both the complexity of current protocols and low input DNA quantity. Thus, working with tiny organisms such as microarthropods can be challenging. Here, we propose easy to set up optimizations for DNA extraction and library preparation prior to target enrichment. Prepared libraries were used to capture 1,432 ultraconserved elements (UCEs) from microhymenoptera (Chalcidoidea), which are among the tiniest insects on Earth and the most commercialized worldwide for biological control purposes. Results show no correlation between input DNA quantities (1.8–250 ng, 0.4 ng with an extra whole genome amplification step) and the number of sequenced UCEs on an Illumina MiSeq. Phylogenetic inferences highlight the potential of UCEs to solve relationships within the families of chalcid wasps, which has not been achieved so far. The protocol (library preparation + target enrichment) allows processing 96 specimens in five working days, by a single person, without requiring the use of expensive robotic molecular biology platforms, which could help to generalize the use of target enrichment for minute specimens.

Indeed, current protocols (DNA extraction, library preparation, target enrichment) are time-consuming and require handling expertise.
They have been initially developed to work with large amounts of input DNA (e.g., vertebrates or large/medium-size insects ;Faircloth, Branstetter, White, & Brady, 2015;McCormack et al., 2013) and include many purification steps that increase DNA loss. Working on hyperdiverse groups of microarthropods is challenging, as it requires one to perform the extraction on (a) a large number of specimens/ species to be representative of the overall diversity of the group, without the possibility of using pipetting robots that increase DNA loss, (b) single individuals because species complexes are frequent (Al Khatib et al., 2014;Kenyon, Buerki, Hansson, Alvarez, & Benrey, 2015;Mottern & Heraty, 2014), (c) the whole insect without destruction for vouchering and often prior to species identification, (d) rare species that have been collected once and may be represented in collections by a few specimens or only one specimen and, sometimes, (e) old and dry museum specimens used for species description (types).
In this study, we propose optimized protocols for DNA extraction and library preparation for target enrichment purposes, as well as a custom pipeline to analyse the sequence data obtained. We used these protocols and customized pipeline to capture and analyse ultraconserved elements (UCEs) in minute wasps, the chalcids (Insecta: Hymenoptera: Chalcidoidea, Heraty et al., 2013;Noyes, 2018), that are key components of terrestrial ecosystems. Chalcids are key models for basic and applied research. With an estimated diversity of more than 500,000 species, these microhymenoptera have colonized almost all extant terrestrial habitats. Many of them develop as parasitoids of arthropod eggs, larvae or pupae. As such, they are both key regulators of the populations of many other arthropod species in natural ecosystems and are increasingly used worldwide as biocontrol agents (e.g., Consoli et al., 2010;Heraty, 2009).
Their small size, huge diversity and widespread morphological convergence make chalcid wasps difficult to identify to species by nonexpert taxonomists, which limits their use in biological control.
Attempts have been made to resolve the phylogeny of the whole superfamily (Heraty et al., 2013;Munro et al., 2011) or a few families (Burks, Heraty, Gebiola, & Hansson, 2011;Chen, Hui, Fu, & Huang, 2004;Cruaud et al., 2012;Desjardins, Regier, & Mitter, 2007;Janšta et al., 2018;Owen, George, Pinto, & Heraty, 2007), but none has succeeded. Indeed, the few markers that could be targeted with Sanger sequencing were not informative enough to solve deeper relationships. A study based on transcriptome data (3,239 single-copy genes) obtained from 37 species of chalcids and 11 outgroups also failed to solve relationships within the superfamily (Peters et al., 2018). As only a representative sampling in both markers and taxa will allow one to draw accurate conclusions on the history of this hyperdiverse group, target enrichment approaches appear relevant.
Here, we provide a detailed description of the optimized protocol for DNA extraction and library preparation, followed by a description of the phylogenetic trees obtained through target enrichment of UCEs from 96 species belonging to seven families and one subfamily of chalcids used for biological control (Aphelinidae, Azotidae, Encyrtidae, Eulophidae, Mymaridae, Pteromalidae: Eunotinae, Signiphoridae, Trichogrammatidae) as well as three outgroups in Mymarommatidae, the putative sister group to Chalcidoidea (Gibson, Read, & Huber, 2007;Heraty et al., 2013).

| Sampling
Samples were taken from the personal collections of the co-authors of this study or borrowed from the Queensland Museum (Australia) or the Australian National Insect Collection, Canberra. Details of the 99 samples included in the analysis are presented in Supporting Information Table S1. Most specimens sampled in the field were placed directly into ethanol for storage. On average, specimens spent 3.5 years in ethanol before being processed (maximum storage time in alcohol = 34 years). Two specimens were critical point dried 25 or 34 years ago. UCE data for three specimens were retrieved from a previous study (Branstetter, Danforth et al., 2017a): Euplectrus sp.

DNA was extracted using the Qiagen DNeasy Blood and Tissue
Kit. All extractions were conducted without destruction of the specimens' external (and certain internal) structures, with digestion and lysis of just the soft tissues. In this way, actual or potential type specimens are preserved. An often-essential feedback to the morphology is also preserved which is critical in this difficult group. Extractions were done from single specimens (sample codes in Supporting Information Table S1 ending with 01) or from a pool of several specimens (sample codes ending with 89).
The following modifications were made to manufacturer's protocol. Samples were incubated overnight in an Eppendorf thermomixer (temperature = 56°C, mixing frequency = 300 rpm). To increase DNA yield, two successive elutions (50 μl each) were performed with heated buffer AE (56°C) and an incubation step of 15 min followed by centrifugation (6000 g for 1 min at room temperature).
Eppendorf microtubes LoBind 1.5 ml were used for elution and to store DNA at −20°C until library preparation. DNA was quantified with a Qubit ® 2.0 Fluorometer (Invitrogen). The final version of the

DNA extraction protocol is available as an Supporting Information
Additional File S1. Vouchers were deposited at CBGP, Montferriersur-Lez, France, or returned to their owner.

| Whole genome amplification
DNA extracted from two specimens was subjected to ethanol precipitation and whole genome amplification (WGA) using the GenomiPhi™ V2 DNA Amplification Kit (GE Healthcare) as described in Cruaud et al. (2018). One microlitre of concentrated DNA was used as input (i.e., 4 ng or 0.4 ng, Supporting Information Table S1).

| Library preparation
Our starting point was the protocol described in http://ultraconserved.org and Faircloth et al. (2015). The final goal was to obtain a standardized protocol that could be implemented by one person and which, in 5 working days, would allow the manual preparation of libraries and the capture of UCEs from 96 samples in parallel.
Step-bystep optimizations were made specially to remove time-consuming purification steps, and different reagents were tested. The final version of the protocol is available as Supporting Information Additional File S2. Briefly, input DNA was sheared to a size of ca 400 bp using the Bioruptor ® Pico (Diagenode). End repair, 3′-end adenylylation, adapters ligation and PCR enrichment were then performed with the NEBNext Ultra II DNA Library Prep Kit for Illumina (NEB). We used barcoded adapters that contained amplification and Illumina sequencing primer sites, as well as a nucleotide barcode of 5 or 6 bp long for sample identification (Supporting Information Additional File S3). Pools of 16 samples were made at equimolar ratio. We enriched each pool using the 2,749 probes designed by Faircloth et al. (2015) using MyBaits kits (MYcroarray, Inc.). We followed manufacturer's protocol (MyBaits, user manual version 3, http://www.mycroarray. com/pdf/MYbaits-manual-v3.pdf). The hybridization reaction was run for 24 hr at 65°C. Postenrichment amplification was performed on beads with the KAPA HiFi HotStart ReadyMix. The enriched libraries were quantified with Qubit, an Agilent Bioanalyzer and qPCR with the Library Quantification Kit-Illumina/Universal from KAPA (KK4824). They were then pooled at equimolar ratio. Paired-end sequencing (2*300 bp) was performed on an Illumina MiSeq platform at UMR AGAP (Montpellier, France) to get longer flanking regions and, as a consequence, more information to differentiate closely related species.

| Raw data cleaning
The analytical workflow is summarized in Supporting Information Assembly of cleaned reads was performed using cap3 (-i 25 -o 25 -s 400) (Huang & Madan, 1999). The 2,749 probes designed by Faircloth et al. (2015) were assembled into nonoverlapping UCEs (hereafter called reference UCEs, n = 1,432, Supporting Information Additional File S5 and S6) using geneious 8.1.8 (Kearse et al., 2012), and contigs were aligned to this set of reference UCEs using lastz Release 1.02.00 (Harris, 2007). Contigs that aligned with more than one reference UCE and different contigs that aligned with the same reference UCE were filtered out using geneious 8.1.8.

| Data analysis
UCEs for which sequences were available for more than 25% of the taxa were kept in the next steps of the analysis. Alignments were performed with mafft version 7.245 (Katoh & Standley, 2013) (-linsi option). Ambiguously aligned blocks were removed using gblock _0.91b with relaxed constraints (−t = d −b2 = b1 −b3 = 10 −b4 = 2 −b5 = h) (Talavera & Castresana, 2007). The final data set was analysed using supermatrix approaches and coalescent-based summary methods. Two gene tree reconciliation approaches were used: astral-iii version 5.6.1 (Zhang, Rabiee, Sayyari, & Mirarab, 2018), which computes the phylogeny that agrees with the largest number of quartet trees induced by the set of input gene trees, and astrid (Vachaspati & Warnow, 2015), which takes a set of gene trees, computes a distance matrix (ca sum of number of edges in the path between two samples divided by the number of gene trees in which the two samples are represented) and infers a phylogeny from this distance matrix. Following recommendations for incomplete distance matrices, BioNJ was used to compute the phylogeny. Individual trees were inferred from each UCE using raxmlhpc-pthreads-avx (Stamatakis, 2014) (version 8.2.4; -f a -x 12345 -p 12345 -# 100 -m GTRGAMMA). ASTRAL and ASTRID analyses were performed with 100 multilocus bootstrapping (MLBS, site-only resampling [Seo, 2008]). Phylogenetic trees were estimated from the concatenate, unpartitioned data set using maximum-likelihood (ML) approaches as implemented in RAxML and iq-tree version 1.6.4 (Nguyen, Schmidt, Haeseler, & Minh, 2015). For the RAxML analysis, a rapid bootstrap search (100 replicates) followed by a thorough ML search (-m GTRGAMMA) was performed. For the iq-tree analysis, a ML search with the best-fit substitution model automatically selected was performed with branch supports assessed with ultrafast bootstrap (Minh, Nguyen, & Haeseler, 2013) and SH-aLRT test (Guindon et al., 2010) (1,000 replicates).
Summary statistics for all data sets (alignment length, number of samples, number of variable sites, number of parsimony informative sites, etc.) were calculated using amas (Borowiec, 2016). Tree annotation was performed with treegraph 2.13 (Stöver & Müller, 2010).
Linear correlation between the number of UCEs and the quan-

Optimizations made for DNA extraction are detailed in the Supporting
Information Additional File S1. The final version of the library preparation protocol is available as Supporting Information Additional  Specimens retrieved from a previous study (Branstetter, Danforth et al., 2017a)

| D ISCUSS I ON
To our knowledge, this study is the second after Sproul and Maddison (2017) to demonstrate success in library preparation from such low input using commercial kits, and the first to report successful sequencing of >1,000 low copy genes in 96 specimens in parallel, from such low input and processing time. Our optimizations differ from what was proposed by Sproul and Maddison (2017).
First, we tried to optimize DNA extraction itself by using overnight  lysis with gentle mixing to preserve fragile specimens, heated elution buffer and increased incubation time before elution. Second, instead of increasing the number of time-consuming purification steps, we decreased them. It is noteworthy that in this library, only two historical specimens were included. This may have masked challenges posed by adapter dimers (Burrell, Disotell, & Bergey, 2015;Sproul & Maddison, 2017;Tin, Economo, & Mikheyev, 2014) that led Sproul and Maddison (2017) to add a second bead clean-up prior to library amplification. However, we have already used this protocol on hundreds of chalcid and moth species, including historical specimens that were processed the same way as fresh ones, and we never had such an issue. Finally, instead of increasing the number of amplification cycles prior to target enrichment, we used a new generation mastermix including a hot start, processive and high-fidelity polymerase (NEBNext Ultra II Q5 Master Mix). Our protocol also allows one to back-up DNA at several steps that allows for multiple attempts without delay in case the first attempt fails. Finally, sequencing was performed on a MiSeq to get longer flanking regions and, as a consequence, more information to differentiate closely related species.
The protocol was successfully used on minute chalcid wasps widely used for biological control purposes. Up to 1,165 valid UCEs were captured from 25 ng DNA (median amount of DNA used for this study). No correlation was observed between the quantity of DNA used for library preparation and the number of captured UCEs.
The average number of captured UCEs validated by our quality control workflow was 687, and 685 valid UCEs were captured from a tiny aphelinid (1.8 ng of input DNA). The number of UCEs obtained per individual appears to drop within the basal clades (i.e., Mymaridae and Trichogrammatidae, Supporting Information Figure S2), a result probably linked to the relatively long branches observed in these groups, and that could reduce the efficiency of the probes that were designed from the genome of Nasonia (Faircloth et al., 2015). Trees were well resolved at the family level, with high statistical support, showing the potential of UCEs to solve long-standing taxonomic issues. However, the tree backbone remained unresolved, a pattern that confirms the rapid diversification of the group (Heraty et al., 2013;Peters et al., 2018). Understanding the evolutionary history of the group was not the purpose of this paper. Indeed, only a representative sampling in both markers and taxa as well as cutting-edge data analysis will allow drawing accurate conclusions. By providing suitable tools for a fast, easy and affordable acquisition of data, this paper is a first step.
Interestingly, as it has been shown previously on RADseq data , WGA does not seem to bias the results even when input DNA is below the recommendation on the manufacturer kit (here 0.4 and 4 ng when the GenomiPhi Kit V2 requires 10 ng).
It is noteworthy that, after completion of this paper and using the same protocol, we were able to capture ca 300 UCEs from native DNA with a concentration that falls below the detection limit of the Qubit, without WGA. Reducing the amount of DNA required for library preparation allows one to use extracted DNA for different approaches in parallel (RADseq, amplicon, Shotgun, etc.). This also allows one to send DNA back to museums from which specimens were borrowed, for archival purposes. When DNA quantities are too low, an extra WGA step can be performed. We definitely agree with Sproul and Maddison (2017) who emphasize how important it is not to waste DNA obtained from irreplaceable specimens (whether fresh or historical). It is even more important to capitalize on existing collections as collecting samples for large-scale studies may be more and more difficult, given that many countries have imposed restrictive access regulations, even to academic researchers, to reduce the risk of supposed biopiracy (Divakaran Prathapan, Pethiyagoda, Bawa, Raven, & Rajan, 2018).
All the elements discussed above indicate that this protocol may be of great help to reconstruct phylogenetic hypotheses in multiple groups of tiny arthropods, for example, springtails, sandflies, lice, whiteflies and mites. Coupled or not with WGA, some steps of our protocol may also contribute to simplify and hasten construction of other reduced-representation libraries (RADseq, ddRADSeq, etc).

ACK N OWLED G EM ENTS
We are grateful to Nicole Fisher (CSIRO, National Museum of Australia, Canberra), Chris Burwell, Christine Lambkin and Susan Wright (Queensland Museum, Brisbane, Australia) for facilitating access to samples and to the Genotoul bioinformatics platform Toulouse Midi-Pyrenees, France, for providing computing resources.
AC thanks Lisa Schlée, Philippe Cruaud and Maya for their patience in the field and Jean-Pierre Rossi for his help with R. AG thanks V.
Fursov (Schmalhausen Institute of Zoology, Ukraine) for his help with specimen identification. We thank two anonymous reviewers for their careful review of our manuscript. This work is part of a large NSF project (Award #1555808) led by John Heraty (UC Riverside USA) that attempts to solve the phylogeny of Chalcidoidea with NGS approaches, and was partly funded by the ANR project F I G U R E 2 RAxML tree obtained from the analysis of the 25%-complete data set. Black squares indicate node supported with RAxML BP = 100, IQ-TREE aLRT = 100/BP = 100 and ASTRAL/ASTRID BP >75. Grey square indicates node with RAxML BP = 100, IQ-TREE aLRT = 100/BP = 100 and ASTRAL BP >75. White squares indicate nodes with RAxML BP >95 and IQ-TREE aLRT >80/BP >95. The DNA quantity used to build the library as well as the number of UCEs analysed for each sample is given in brackets. Photographs ©J.-Y. Rasplus. Scale bars = 500 µm. IQ-TREE, ASTRAL and ASTRID trees are available in Supporting Information Figures  drafted the manuscript. All authors commented on the manuscript.

DATA ACCE SS I B I LIT Y
Demultiplexed cleaned reads merged with FLASH are available as a NCBI Sequence Read Archive (ID# PRJNA495844).