A high-quality genome assembly and annotation of the European earwig Forficula auricularia

The European earwig Forficula auricularia is an important model for studies of maternal care, sexual selection, sociality and host-parasite interactions. However, detailed genetic investigations of this species are hindered by a lack of genomic resources. Here we present a high-quality hybrid genome assembly for F. auricularia. The genome was assembled using nanopore long-reads and 10x chromium link-reads. The final assembly is 1.06Gb in length with 31.03% GC content. It consists of 919 scaffolds with an N50 of 12.55Mb. Half of the genome is present in only 20 scaffolds. Benchmarking Universal Single-Copy Orthologs scores are ∼90% from three sets of single-copy orthologs (eukaryotic, insect, and arthropod). The total repeat elements in the genome are 64.62%. The MAKER2 pipeline annotated 12,876 protein-coding genes and 21,031 mRNAs. The genome assembly, annotation, and associated resources will be of high value to a large and diverse group of researchers working on Dermapterans.


Introduction
Insects have been at the forefront of genetic research for various biological questions (Wilson-Sanders 2011; Mukherjee et al. 2015;Simons and Tibbetts 2019). However, most of the genetic studies are carried out on a small number of holometabolous insects that undergo true metamorphosis. In contrast to Holometabola, hemimetabolous insects undergo incomplete metamorphosis with a series of nymphal moults that increasingly resemble the adult form (Truman 2019). It is widely accepted that Holometabola branched out from hemimetabolous ancestors during the Permian 300 Mya (Labandeira and Phillips 1996;Yang 2001). Yet the conserved mode of development, embryonic organisation, and the adult body plan of hemimetabolous insects offer a unique model for the study of developmental and evolutionary mechanisms. However, even with the increasing number of sequenced genomes, the majority belong to the Holometabola (Ylla et al. 2021). This has been a bottleneck for the exploration of the diverse biology and life history of hemimetabolous insects. To address this paucity, we report a high-quality annotated genome of the European earwig, Forficula auricularia (Dermaptera: Forficulidae).
The European earwig Forficula auricularia is widely distributed. They are native to the western Eurasian region and were introduced to North America, Australia, and New Zealand, where they have quickly adapted and became abundant throughout the regions (Quarrell et al. 2018;Tourneur and Meunier 2019). Their propensity to dwell on flower and kitchen gardens can cause significant damage to crops, flowers, and commercial vegetables and make them important agricultural pests (Campos et al. 2011;Hill et al. 2019).
They have been of particular interest for many researchers not just because of their importance in the agricultural ecosystem (Binns et al. 2021) but also their importance as a research model for various biological and evolutionary phenomena like sexual selection, maternal care, family interactions, reproductive strategy, and social behaviour (Forslund 2000;Falk et al. 2014;Kramer et al. 2015;Van Meyel and Meunier 2020). They have been extensively studied by behavioural ecologists for the early evolution of group-living and family life (Falk et al. 2014).
The male earwigs also show an unusual bias in their use of lateral left and right sexual organs without any conspicuous anatomical differentiation (Kamimura 2006). Like the righthandedness in humans, 90% of males of giant earwig Labidura riparia show a preference for the right penis for copulation, providing insights into the evolutionary origin of lateralization (Kamimura et al. 2021). Similarly, they are an excellent lab model to study extended phenotypes as they exhibit strange suicidal water-seeking behaviour during the late stages of infection by mermithid nematodes (Herbison et al. 2019). However, their use as a genetic model has been severely limited by the lack of reference genome.
Here, we have sequenced, assembled, annotated, and analysed the genome of the European Earwig, Forficula auricularia. This genome will help researchers study multiple facets of this insect's exciting biology and evolutionary characters and broaden our understanding of insect and genome evolution.

Sample collection and preparation
Earwigs (Forficula auricularia) were collected from the Dunedin Botanic Garden (-45° 51' 27.59" S, 170° 31' 15.56" E) and reared in a temperature-controlled room (Temperature: cycling from 15 to 12°C, day/night; Photoperiod of L:D 16:8) in the Department of Zoology, University of Otago, Dunedin. Earwigs were snap-frozen in liquid nitrogen and stored at -80°C before dissection and subsequent nucleotide extraction. Earwigs were dissected in 1x PBS buffer under a dissection microscope to check for nematode parasites, and only non-parasitized individuals were used in this study. The head, wings and muscles from the thorax region were used for DNA extraction. Juvenile instars required for RNA extraction were obtained directly from the field.

DNA extractions
DNA was extracted using either the Nanobind Tissue Big DNA kit (Circulomics, USA) or DNeasy Blood & Tissue Kit (Qiagen, Germany) by following the manufacturer's protocol.
Tissues from a single individual were used for each extraction. After the extraction, RNase treatment was performed using 4µl of RNase A (10mg/ml) per 200µl of DNA elute. DNA was quantified in a Qubit 2.0 Fluorometer (Life Technologies, USA) and quality analysed using Nanodrop. Low quality DNA samples were further cleaned with 1.8x by volume AMPure XP beads (Beckman Coulter, USA), wherever applicable, following the manufacturer's instructions and eluted in 55µl of molecular grade water. High-quality DNA samples were stored at -20°C and were used within a week of extraction.

Linked read library preparation and sequencing
Linked read library was prepared at the Genetic Analysis Services (GAS), University of Otago (Dunedin, New Zealand). DNA from an adult male was extracted using the Nanobind kit and size-selected for fragments over 40kbp using Blue Pippin (Sage Science, USA). A chromium 10X linked read (10x Genomics, USA) library was prepared following the manufacturer's instructions. The library was sequenced on the Illumina Nova-seq platform to generate 2x151bp paired-end reads (Garvan Institute, Australia).

Long read library preparation and sequencing
Five long-read sequencing libraries for Oxford minion were prepared using a ligation sequencing kit (SQK-LSK109) (Oxford Nanopore Technologies, Oxford, UK) following the manufacturer's instructions. To increase the raw nanopore read N50 the first and the second libraries were prepared using 1.75 and 0.75µg of DNA extracted via a Circulomics kit from two adult male earwigs. Both libraries were sequenced in a single minion flow cell, flushing the flow cell to remove remains of the first library before loading the second library with a Flow cell wash kit (Oxford Nanopore Technologies, Oxford, UK).
To increase the total raw output the third and the fourth libraries were prepared with DNA from two adult female earwigs, both extracted with a Qiagen kit followed by the AMPure XP beads clean-up step. Input DNA for these two libraries were 2.6 and 3.2 µg. These were each sequenced on an individual minion flow cell. The fifth library was prepared using 3.0 µg of DNA from an adult male earwig. As before DNA was extracted using a Qiagen kit followed by AMPure XP beads clean-up. However, before library preparation, the DNA was sheared five times using a 26G X 0.5" needle (Terumo, Japan). All prepared libraries were sequenced with R9 chemistry MinION flow cell (FLO-MIN106) (Oxford Nanopore Technologies, UK).

RNA extraction and sequencing
Total RNA from the different developmental stages, sex, and tissues was extracted using a Direct-zol RNA MicroPrep kit (Zymo Research) with an on filter DNAse treatment following the manufacturer's instructions. Samples included: whole body (gut removed) of juvenile instars 1-2 and juvenile instars 3-4, dissected tissues (antennae, head, thorax, abdomen, legs, and gonads) of adult males and females. RNA from each individual and tissue type was extracted separately. RNA was quantified on a Qubit 2.0 Fluorometer (Life Technologies, USA) and initially quality checked using a nanodrop. Only high-quality extracts were further processed and were stored at -80°C until use.

RNA integrity was evaluated on a Fragment Analyzer (Advanced Analytical Technologies
Inc., USA) at the Otago Genomics Facility (OGF), University of Otago, Dunedin, New Zealand. As with most of the insect RNA extracts (Winnebeck et al. 2010) RNA quality number (RQN) values ranged from 2.5 to 10 due to the collapsing of the 28S peak; quality was thus determined via the trace rather than RQN. Four pools of samples at equimolar concentration underwent library preparation. Pools consisted of: 8 whole body extractions for juvenile instar 1-2, 8 whole body extractions for juvenile instar 3-4, individual body tissues from five adult males, and individual body tissues from five adult females. TruSeq stranded mRNA libraries were prepared and sequenced as 2x100bp paired-end reads across two lanes of HiSeq 2500 Rapid V2 flowcell at the OGF.

Genome size estimation
Flow cytometry and k-mer based approach with short-read data were used to estimate the genome size. Flow cytometry analysis was performed on a single head of earwig with two biological replicates at Flowjoanna (Palmerston North, NZ). Briefly, the earwig's head was dissociated with a pestle in 500µl of the stock solution containing 0.1% w/v trisodium citrate dihydrate, 0.1% v/v IGEPAL, 0.052% w/v spermine tetrahydrochloride, and 0.006% sigma 7-9 (all Sigma-Aldrich, USA). Rooster red blood cells (RRBC) from the domestic rooster stored in citrate buffer was used as a reference sample. Test samples were filtered through 35µl filter cap and further dissociated by adding 100µl of 0.21mg/ml trypsin followed by 75µl of 2.5mg/ml trypsin inhibitor (both Sigma-Aldrich) for 10 minutes at 37°C. Nuclei were stained using 100µl of pre-stain (containing 416mg/ml propidium iodide (PI) with 500mg/ml RNAse in-stock solution). Two sample tubes, one together with prepared RRBC and one without, were then measured on a FACSCalibur (BD Biosciences, USA) equipped with a 488nm laser to produce fluorescence collected using the FL-2-Area signal (585/42 BP), along with forward scatter (FSC) and side scatter (SSC) signals which were included for resolving RRBC nuclei from the earwig's nuclei. Data were analysed using Flowjo (BD Biosciences, USA) and the pg/nuclei of the sample was calculated.

Bioinformatic pipeline
All the scripts used for genome assembly, denovo repeat library construction, and annotation are available on GitHub (https://github.com/upendrabhattarai/Earwig_Genome_Project). The bioinformatics software and packages were run in New Zealand eScience Infrastructure (NeSI).
Below is a description of the pipeline (Figure 1).

Genome assembly
Paired-end Illumina reads from the chromium library were assembled using Supernova (v.2.1.1) (Weisenfeld et al. 2017). Assembly quality was assessed using Quast (Gurevich et al. 2013) and BUSCO (Simão et al. 2015) metrics such as N50 values, contig/scaffold number, contiguity and ortholog completeness. Based on several trial assemblies, we down-sampled the total input to 660 million paired-end reads to produce an assembly with better completeness and contiguity. The assembled fasta sequence was obtained with "pseudohap" style of the supernova "mkoutput" function. Nanopore reads were basecalled using guppy (v.5.0.7) and assembled using Flye (v.2.7.1) (Kolmogorov et al. 2019) with default parameters. The assembly statistics (as mentioned above) for the Flye assembly were better than that for the Supernova assembly. Therefore, the supernova assembly was merged into the Flye assembly using Quickmerge (Chakraborty et al. 2016). The resulting assembly was processed with Purgehaplotigs (Roach et al. 2018 (Warren et al. 2015). mRNA-seq reads sequenced for genome annotation purposes, and total RNA-seq reads sequenced for another project (manuscript under preparation) were also used for scaffolding the assembly with the Rascaf (Song et al. 2016). Duplicated and redundant haplotigs were again removed using Purgehaplotigs, and discarded haplotigs were used for scaffolding the assembly using Ragtag again.
Blobtools2 (Laetsch and Blaxter 2017) was used to remove small (<100bp) and low coverage contigs (<5x coverage). Contigs that were filtered out were used for re-scaffolding the assembly with Ragtag. Finally, we used Pilon (v.1.24) to polish the assembly using mRNAseq data.

Repeat content analysis
To assist with annotation a custom repeat library was generated for the Earwig genome using different de novo repeat and homology-based identifiers, including LTRharvest (Ellinghaus et al. 2008), LTRdigest (Steinbiss et al. 2009), RepeatModeler (Flynn et al. 2020), TransposonPSI (Brian Haas 2010 and SINEBase (Vassetzky and Kramerov 2013). We concatenated the individual libraries, and sequences with more than 80% similarity were merged to remove redundancy using usearch (v.11.0.667) (Edgar 2010). It was then classified using RepeatClassifier. Sequences with unknown categories in the library were mapped against the UniProtKB/Swiss-Prot database (e-value <1e-01); if sequences were not annotated as repeat sequences they were removed from the library. The final repeat library was used in RepeatMasker (v.4.1.2) (Chen 2004) to generate a report for genome repeat content and provided to the Maker2 pipeline to mask the genome.

Genome annotation
Genome annotation was carried out with three iterations of the MAKER2 (v.2.31.9) (Holt and Yandell 2011) pipeline combining evidence-based and ab initio gene models. The first round of Maker used evidence-based models and the other two rounds were run using ab initio gene models. For the first round, we provided the MAKER2 pipeline with 180,119 mRNA transcripts denovo assembled via the Trinity pipeline (Grabherr et al. 2011) along with 26,414 mRNA and 1,529 protein sequences of dermapterans from NCBI and 779 dermapteran protein sequences from the Uniprot database.
Augustus was trained using Braker (v.2.16) (Hoff et al. 2019) and SNAP was trained after each round of MAKER to use for ab initio gene model prediction. For the functional annotation, we ran InterProScan (v.5.51-85.0) (Jones et al. 2014) for the predicted protein sequences obtained from MAKER and retrieved InterPro ID, PFAM domains, and Gene Ontology (GO) terms.
Furthermore, we ran BLAST with the Uniprot database to assign gene descriptors to each transcript based on the best BLAST hit.

Genome size estimates
The flow cytometer estimated the genome size of 968.22±20.747 Mb (Mean±SD) for the earwig genome. Similarly, the kmer based approach using adapter removed paired-end data from linked read sequencing estimated the male earwig to be 988 Mb. Whereas, an earlier estimation of a unknown dermapteran (earwig) species genome size was 1.4 Gb (Gregory 2005) showing a variable genome size within the order.

Genome assembly
A total of 799.6 million paired-end reads was generated for the linked-read library.
Downsampled to 660 million paired-end reads, Supernova estimated the genome size of 1.22 Gb, raw coverage of 82.02%, effective coverage of 39.50% and weighted mean molecule size of 22.45 kb. The Supernova assembly was 1.15 Gb in size and had 145,055 contigs, with an N50 of 0.03 Mb and L50 of 7,500. Quast reported a complete Busco of 64.69% and a partial Busco of 9.24% from the eukaryotic database.
The nanopore sequencing yielded approximately 10.7 Gb of data, consisting of over 3 million reads. The median read length was 897 bp with an N50 length of 11,986 bp (Supplementary Table 1). The median read Phred quality was 13.34. Primary assembly from Flye produced an assembly of 1.1 Gb. There were 187,66 contigs with N50 of 0.18 Mb and L50 of 1,832. Quast reported a complete Busco of 82.18% and a partial Busco of 9.24%. The long-read assembly showed better contiguity and completeness, so we merged the linked-read assembly with the long-read assembly ( Table 1).
The final hybrid assembly has a size of 1.06Gb. It has 919 scaffolds with an N50 of 12.55Mb, which shows that the assembly is highly contiguous. Half of the genome is present in just 20 scaffolds, as denoted by the L50 number (Table 1). Assembly has 846.85 "N's" per 100kbp.
The Busco score from the insect database (n=1,367) for the assembly was 87.1% complete, among which 4.1% were duplicated, and 3.1% fragmented Busco (Figure 2). Improvement in assembly statistics after each processing step is given in Supplementary Table 2.
The only other whole-genome sequence publicly available from the Dermaptera order is of the other earwig Anisolabis maritima. Using the insect database (n=1,367) it has 83.4% complete and 10.8% fragmented Busco scores. The A. maritima assembly has an N50 of 1.4Mb with a genome size of 649.7 Mb (Mei et al. 2022). So, in comparison, the F. auricularia genome assembly has a better gene model and contiguity from the Dermaptera order.

Genome repeat contents
Repeat analysis of the assembly showed that interspersed repeats comprised ~686 Mb  (Table 2). Unusually large and variable genome sizes characterise Hemimetabolans (Wu et al. 2017). Comparative analysis in six species of Gomphocerine grasshoppers showed a strong positive correlation between repeat content and genome size. Genome size ranged from 8.2 to 13.7 Gb in these six species with a repeat content ranging routinely from 79% to 87%, with the exception of Stauroderus scalaris whose genome is 96% repetitive DNA and the second-largest insect genome documented. Our estimation of genome size for F. auricularia doesn't show gigantism (~968.22Mb, flow cytometer estimate). However, its repeatome (64.62%), is almost twice that of other hemimetabolous insects like Gryllus bimaculatus (33.69%) and Laupala kohalensis (35.51%) (Ylla et al. 2021). This fold increase in the repeatome is surprising given both G. bimaculatus and L. kohalensis have bigger genomes (~1.6 Gb) than F. auricularia.

Genome annotation
Combining evidence-based and ab initio gene models in the MAKER2 pipeline, we identified 12,876 genes and 21,031 mRNAs in the genome assembly. The mean gene length is 12,096 bp and the total gene length is 155.75 Mb, which makes 14.7% of the whole assembly. The longest gene annotated is 412,198 bp and the longest CDS is 19,035 bp (Table 3). 61.35% of total predicted mRNAs and 59.53% of predicted proteins are also functionally annotated through either one or more of InterPro, Gene ontology and Pfam databases (Supplementary Table 3).
Comparing with the Arthropoda database we got 74.4% complete Busco score for annotated Transcriptome and 70.9% complete Busco score for annotated Proteins (Figure 3). 98.3% of the gene models have AED score of 0.5 or less, assuring highly confident gene prediction (Supplementary Figure 1).
The GC content of the F. auricularia genome is 31.03%, markedly greater than that the 19.3% GC observed in the genome of the earwig A. maritima (Mei et al. 2022) There was significant different for each pairwise comparison using Anova followed by Tukey HSD with P < 0.01 for Intergenic vs 10kb window and P < 0.0001 for all others.
In conclusion, the assembled and annotated F. auricularia genome presented here is a key resource to develop this important insect species as a genetic model. We anticipate this will enhance the genetic study on various aspects of its biology, including developmental biology, sociality, and evolutionary characteristics.

Data Availability Statement
The genome assembly and annotation of Forficula auricularia are available through FigShare https://doi.org/10.6084/m9.figshare.19092044. The raw sequencing reads are deposited in NCBI with accession number PRJNA800435.

Code availability
The scripts used for genome assembly, repeat library preparation and masking, and genome annotation, are available at GitHub (https://github.com/upendrabhattarai/Earwig_genome_project) under GNU GPLv3 license.  Complete and single-copy orthologs, dark blue represents complete and duplicated orthologs, yellow represents fragmented Busco genes and red represents missing Busco genes. percentage (x-axis) for the annotated Proteins and Transcriptomes using Arthropoda_odb10 and Insecta_odb10 database as indicated in the y-axis.