Chromosome-level genome assembly and annotation of two lineages of the ant Cataglyphis hispanica: steppingstones towards genomic studies of hybridogenesis and thermal adaptation in desert ants

Hugo Darras; Natalia de Souza Araujo; Lyam Baudry; Nadège Guiglielmoni; Pedro Lorite; Martial Marbouty; Fernando Rodriguez; Irina Arkhipova; Romain Koszul; Jean-François Flot; Serge Aron

doi:10.1101/2022.01.07.475286

ABSTRACT

Cataglyphis are thermophilic ants that forage during the day when temperatures are highest and sometimes close to their critical thermal limit. Several Cataglyphis species have evolved unusual reproductive systems such as facultative queen parthenogenesis or social hybridogenesis, which have not yet been investigated in detail at the molecular level. We generated high-quality genome assemblies for two hybridogenetic lineages of the Iberian ant Cataglyphis hispanica using long-read Nanopore sequencing and exploited chromosome conformation capture (3C) sequencing to assemble contigs into 26 and 27 chromosomes, respectively. Males of one lineage were karyotyped to confirm the number of chromosomes inferred from 3C data. We obtained transcriptomic data to assist gene annotation and built custom repeat libraries for each of the two assemblies. Comparative analyses with 19 other published ant genomes were also conducted. These new genomic resources pave the way for exploring the genetic mechanisms underlying the remarkable thermal adaptation and the molecular mechanisms associated with transitions between different genetic systems characteristics of the ant genus Cataglyphis.

BACKGROUND

Ants of the genus Cataglyphis inhabit arid regions throughout the Old World, including inhospitable deserts such as the Sahara (Lenoir et al. 1990; Boulay et al. 2017). Their foraging activities are strictly diurnal, with most species being active during the hottest hours of the day (Wehner et al., 1992; Cerda et al., 1998). Some Cataglyphis species even forage at temperatures close to their critical thermal limits (Cerda et al., 1998). For instance, workers of the silver ant Cataglyphis bombycina have been observed to forage when ground temperatures exceed 60°C (Wehner et al., 1992), which provides a competitive advantage against lizard predators who avoid such harsh conditions. The high thermal tolerance seen in Cataglyphis species relies on a range of behavioral, morphological, physiological and molecular adaptations, such as exploitation of thermal refuges, elongated legs, high speed of movement and intense recruitment of heat-shock chaperone proteins (Gehring & Wehner 1995; Perez & Aron 2020; Pfeffer et al. 2019; Sommer & Wehner 2012; Willot et al. 2017; Perez et al. 2021; Aron & Wehner 2021).

In addition to their impressive heat tolerance, Cataglyphis ants are prominent social insect models because of their amazing diversity of reproductive traits: the number of queens per colony, the mating frequencies, the dispersal strategies and the modes of production of different castes all vary greatly among species (Aron, Mardulyn, and Leniaud 2016; Peeters and Aron 2017; Aron et al. 2016). Unusual reproductive systems relying on conditional use of sex for the production of different female castes have evolved repeatedly in different Cataglyphis groups. Under these systems, non-reproductive workers are sexually produced, while reproductive queens are asexually produced by thelytokous parthenogenesis – a strategy that allows queens to increase the transmission rate of their genes to their reproductive female offspring while maintaining genetic diversity in the worker force (Pearcy et al. 2004; Kuhn et al. 2020). Males arise from arrhenotokous parthenogenesis, as is usually the case in Hymenoptera. In several species, the conditional use of sex evolved into a unique reproductive system, named clonal social hybridogenesis, whereby male and female sexuals are produced by parthenogenesis while workers are produced exclusively from interbreeding between two sympatric, yet non-recombining genetic lineages (Leniaud et al. 2012; Eyer et al. 2013; Kuhn et al. 2020).

The unique characteristics of Cataglyphis make this ant genus an interesting model to investigate the genetic mechanisms underlying thermal adaptation and the evolution of alternative reproductive strategies. To date, only one incomplete assembly of the genome of one species (Cataglyphis niger) is available for genomic analyses (Yahav & Privman, 2019). To fill this gap, we combined Oxford Nanopore long reads, Illumina short reads and chromosome conformation capture (3C) sequencing (Lieberman-Aiden et al. 2009; Marie-Nelly et al. 2014; Flot, Marie-Nelly, and Koszul 2015) to generate high-quality chromosome-scale genome assemblies of two lineages of the Iberian ant Cataglyphis hispanica (Figure 1). We also annotated and compared the repeats and gene sets of this species with those of other ant genera.

Figure 1.

The ant Cataglyphis hispanica. (A) A queen of C. hispanica (red arrow) surrounded by workers. (B) Sampled sites in southwest Spain (red: Caceres, yellow: Merida and blue: Bonares). The complete range of the species C. hispanica is shown in grey. WGS: whole genome sequencing.

RESULTS AND DISCUSSION

Genome assembly

Cataglyphis hispanica inhabits the most arid habitats of the Iberian Peninsula. Two sympatric hybridogenetic lineages (Chis1 and Chis2) co-occur as a complementary pair across the distribution range of the species (Leniaud et al., 2012; Darras et al., 2014). Queens of each lineage mate with males from the other lineage and produce non-reproductive workers by sexual reproduction. By contrast, male and female reproductive individuals are produced clonally through arrhenotokous and thelytokous parthenogenesis, respectively. As a result, all workers in the colonies are inter-lineage hybrids, but the two reproductive lineages are maintained divergent.

For each of the Chis1 and Chis2 lineages, we generated respectively 5.7 and 5.1 Gbp of Nanopore reads from a pool of sister clonal queens (for de novo long-read assemblies); 32.2 and 34.2 Gbp of PE 2 × 100 bp Illumina reads with insert sizes ranging from 170 bp to 800 bp from a single male (for short read error correction/polishing); and 8.7 and 7.0 Gbp of 3C-seq PE 2 × 66 bp (after demultiplexing) Illumina reads from a single queen (for scaffolding). The long-read assembler Flye (Kolmogorov et al., 2019) generated assemblies consisting of several hundreds of contigs (439 and 929, respectively). The contigs were scaffolded using the 3C data (Marie-Nelly et al., 2014; Baudry et al., 2020): 99.7% of the Chis1 assembly was scaffolded into 26 chromosome-scale (> 2.4 Mb in length) scaffolds (Figure 2A), while 98.8 % of the Chis2 assembly was scaffolded into 27 chromosome-scale scaffolds (Figure 2B). These chromosome-scale scaffolds were numbered by decreasing size. The remaining 0.3 – 1.2% unscaffolded sequences were all relatively small (<40 kb for Chis1, <120 kb for Chis2). The overall sizes of the two scaffolded assemblies were 203 Mb and 209 Mb, respectively. Assembly completeness, as estimated by BUSCO scores (Manni et al., 2021), was very high: among the 5,991 highly conserved single-copy genes of the Hymenoptera odb10 database, 96.8% (Chis1) and 96.1% (Chis2) were complete in each assembly. In addition, only 0.5-0.4% of the BUSCO genes appeared duplicated for both assemblies, suggesting that our assemblies do not contain much uncollapsed haplotypes, if any. In line with these results, KAT analyses based on the Illumina reads of each lineage showed a single peak of k-mer multiplicity, which were almost all represented exactly once in the assemblies as expected for high-quality genomes (Figure S1); k-mer completeness was estimated as 98.86% for Chis1 and 98.45%for Chis2 (Mapleson et al., 2016). For each assembly, a region with no large-scale synteny pattern was assembled at the extremity of one scaffold (the first 5.4 Mb of scaffold #9 of Chis1 and the first 3.1 Mb of scaffold #7 of Chis2). Each of these regions consisted of a collection of small contigs (mostly in the 2-10 Kb range) with 2 to 5 times higher average coverage compared to other genomic regions. These problematic sequences exhibited microsyntenies with the extremities of other large scaffolds (Figure 2; Figure S2) suggesting that they correspond to repeat sequences that were improperly assembled into fragmented contigs.

Figure 2.

Assembly of the Cataglyphis hispanica Chis1 (A) and Chis2 (B) genomes into chromosomes. Hi-C interaction map revealing the presence of 26 and 27 linkage groups. The color scale represents the interaction frequencies. The positions of the rearranged chromosome are indicated, and the arrows show the assembly artefact found in each genome (see main text). The longest chromosome of Chis1 is split in two chromosomes in Chis2 (scaffolds 5 and 9, shown with red and blue colors).

Comparison of the Chis1 and Chis2 assemblies revealed that 25 of the chromosome-scale scaffolds had a one-to-one homolog in each of the two lineages. In addition, and by contrast, the largest scaffold of Chis1 (#1) was split into two chromosome-scale scaffolds (# 5 and #10) in the Chis2 assembly (Figure S2). The 3C contact maps of both lineages showed that these scaffolds (Chis2 #5, #10 and Chis1 #1) correspond to well-individualized 3D features, thereby ruling out a scaffolding error (Figure 2). These observations support that a chromosome centric (Robertsonian) translocation took place in one of the two lineages studied. Centric fusions and fissions are the main mechanisms of karyotype evolution in many animal groups, including ants (Lorite & Palomeque, 2010). Robertsonian translocations have been reported to promote speciation through the suppression of genetic recombination in the vicinity of rearranged centromeric regions or the reduction of fertility in karyotypic hybrids (Davisson & Akeson, 1993; Faria & Navarro, 2010). The significance of this genomic rearrangement for the origin of social hybridogenesis in Cataglyphis deserves further investigation. A different number of chromosomes among lineages could potentially reduce the fertility of inter-lineage hybrids and may contribute to the long-term maintenance of divergent lineages (Schwander, Keller, and Cahan 2007). Notably, two species closely related to C. hispanica also reproduce by social hybridogenesis involving two interdependent lineages (Eyer et al., 2013). We have previously speculated that this recurrent genetic system may have evolved prior to the speciation of these three species. Under this scenario, social hybridogenesis would be controlled by an ancient biallelic non-recombining region maintained by chromosomal rearrangement(s), with two haplotypes accounting for the occurrence of two genetic lineages in each species (Darras et al., 2014).

Intrachromosomal rearrangements between the lineages, consisting in large translocations and inversions, were also observed for 6 of the 25 large orthologous scaffolds (Figure S2), but these could not be confirmed independently with the current data.

Karyotyping

To validate the number of chromosomes scaffolded, we inspected metaphase chromosome plates from haploid males. In ants, as in other Hymenoptera, females are diploid (2n) whereas males are haploid (n). Two males of the Chis2 lineage from two distant localities were analyzed (Figure S3A). Both male karyotypes carried 27 chromosomes. The precise morphology of the chromosomes could not be determined due to their small sizes (Figure 3). This number is within the range reported for Cataglyphis bicolor, Cataglyphis iberica and Cataglyphis setipes species and other Formicine species of the genera Formica, Iberoformica and Polyergus (n= 26-27) (Hauschteck-Jungen & Jungen, 1983; Imai et al., 1984; Lorite & Palomeque, 2010). These results confirm the 3C scaffolding of the Chis2 genome into 27 chromosomes. No male of the Chis1 lineage could be obtained for karyotyping.

Figure 3.

Chromosomes of Cataglyphis hispanica (Chis2). (A) Metaphase chromosome plate and (B) karyotype of a male from Merida (Spain), showing that the haploid chromosome number in this lineage is n=27.

Gene annotation

We annotated the genome of the Chis2 lineage. Ab initio gene prediction using AUGUSTUS and homology-based predictions using GenomeThreader (Gremme et al., 2005) identified 16,993 and 8,234 gene models, respectively. A total of 40,969 models (including isoforms) were also predicted by the PASA/Transdecoder (Haas et al., 2003) pipeline using direct evidence from 13 Gbp of Illumina RNA-seq data. The three annotation sets were validated and combined into a single annotation of 16,146 non-overlapping models using EvidenceModeler (Haas et al., 2008). Among these, 11,101 gene models showed significant similarity to at least one RefSeq ant protein and 10,543 had a functional eggNOG hit (Huerta-Cepas et al. 2017, 2019). After filtering out models with no protein similarity and no known functional information, we obtained a final set of 11,290 high quality gene models. This gene set is comparable in size to those annotated by the NCBI Eukaryotic Genome Annotation Pipeline for other ant genomes (range: 10,491-15,668; N= 18 different RefSeq ant genera; Table S1). We compared the gene set of C. hispanica (Chis2) with 19 published ant annotations. Out of the 258,587 protein-coding genes analyzed using OrthoFinder (Emms & Kelly, 2019), 96.82% (250,353) were placed in 13,698 orthogroups. Of these, 1,407 were species-specific and 6,199 were found in all species including 3,365 single-copy genes. The orthogroup profile of C. hispanica was overall comparable to that of other ants (Figure 4). However, our annotation had one of the smallest number of genes placed in orthogroups (10,918), and one of the largest proportions of unassigned genes (3.3%).

Figure 4.

Summary values from the ortholog analyses. The color intensity indicates the z-score of variation (deviation from the mean) among all species, from the smallest value (blue) to the highest value (orange). The Lasius niger assembly was removed from this comparison due to its low quality.

Repeat annotation

We built custom repeat libraries for each of the two assemblies of C. hispanica and for 19 published ant genomes (Table S1). The Chis1 and Chis2 assemblies contained 1,708 and 1,673 different repetitive elements, which accounted for 15.43% (31,851,170 bp) and 15.1% (31,512,815 bp) of their assembly sizes, respectively (Figure 5). A large proportion of these corresponded to unclassified interspersed repeats (6.7% / 6.78% of the genomes; Figure S4). The two genomes also contained 2.0% / 1.8% of Class I (retroelements), and 2.18% / 1.85% of Class II elements (DNA transposons). In total, 56 different families of repetitive elements were annotated in C. hispanica. LTR/Gypsy were the most frequent transposable elements of Class I in the genomes (0.53% / 0.82%), while large Polintons / Mavericks were the most abundant Class II transposable elements (0.98% / 0.67%).

Figure 5.

Summary of the repetitive elements’ categories annotated in 20 different ant species using our custom pipeline. The ratios of the major categories of repetitive elements identified in each species is shown on the left. The total proportion of repetitive elements found in each genome is shown on the right.

Across published ant assemblies, the total proportion of transposable elements appeared quite variable (range: 17.27 – 48.47%; N= 19 ant species; Figure 5; Table S2). The C. hispanica assemblies had a smaller proportion of repetitive elements (15.1% - 15.43%) than any of these assemblies. This difference in repeat content was slightly smaler when compared to the assembly of Formica exsecta (18.53 %), the closest species available for comparison. The relatively low proportion of transposable elements observed in the genomes of C. hispanica may be due to the fact that it was assembled primarily from noisy nanopore long-reads, possibly leading to a collapse of repeated regions. Alternatively, C. hispanica may resist the invasion and proliferation of transposable elements more efficiently than other species. Whether its unusual reproductive system, combining both diploid and haploid parthenogenesis for queen and male production, could help keep transposable elements at bay deserves further exploration.

Lineage divergence estimation

To estimate divergence between the two genomes sequenced, we analyzed polymorphism at four-fold-degenerate sites, which are expected to be neutrally evolving (every mutation at a four-fold site is synonymous). Our annotation of the Chis2 genome contained 2,620,448 four-fold-degenerate sites. Among these, 13,048 had a different allele in the Chis1 and Chis2 males used to obtain haploid genome consensus. Assuming no recombination and a typical insect mutation rate of approximately 3 × 10^-9 mutations per neutral site per haploid genome per generation (Keightley et al., 2014, 2015; Yang et al., 2015; Liu et al., 2017; Oppold & Pfenninger, 2017), this proportion of mutated four-fold-degenerate sites translated into an average divergence time of about 830,000 generations between the alleles of the two males sequenced (Obbard et al., 2012). Hence, the two genome sequenced may have diverged almost 1 million years ago (assuming one generation per year) - a divergence time similar to what observed between closely related species of fire ants (Cohen & Privman, 2019). The origin of the hybridogenetic lineages themselves could be much younger if these emerged in two divergent populations or share ancestral polymorphism (Darras et al. 2019).

CONCLUSIONS

We generated high-quality chromosome level genome assemblies of the two lineages of the hybridogenetic ant C. hispanica, a representative species of the thermophilic ant genus Cataglyphis. Using chromosome conformation capture, we identified a Robertsonian translocation between the two lineages, resulting in 26 and 27 chromosomes in Chis1 and Chis 2 lineages, respectively. Whether this rearrangement played a role in the origin and maintenance of social hybridogenesis in C. hispanica and across other Cataglyphis species deserves further investigation (Darras et al., 2014).

METHODS

Biological samples

Permits were obtained to collect colonies of Cataglyphis hispanica in three Spanish locations (Bonares, Caceres and Merida; Figure 1B). Male samples from Bonares were used for Illumina DNA sequencing. Shortly after sampling, the Bonares population was wiped out by human activities. Consequently, samples from another locality (Caceres) were used for subsequent Nanopore sequencing, 3C-seq and RNA-seq. Male pupae from two distant localities (Bonares and Merida) were used for karyotyping. Twelve diagnostic microsatellite loci were genotyped to assess the lineage membership of each queen and male prior to sequencing and karyotyping (Darras et al., 2014).

DNA and RNA-Sequencing

Genomic resources were generated for both the Chis1 and the Chis2 lineages. High-molecular-weight DNA was extracted from pure lineage queen and male individuals using QIAGEN Genomic-tips. For each lineage, two queen clones originating from the same nest were used for Nanopore sequencing. Queens of C. hispanica are produced through automictic parthenogenesis with central fusion which results into diploid individuals that are highly homozygosus (Pearcy et al., 2011; Darras et al., 2014) and thus suitable for genome assembly. Nanopore libraries were prepared using rapid sequencing kits (SQK-RAD001 and SQK-RAD004). The resulting long read libraries were sequenced on MIN106 flow cells and basecalled using Albacore v2.1.10. For each lineage, three Illumina libraries were generated from whole-genome amplified DNA extracted from a single male with mean insert sizes of 170 bp, 500 bp and 800 bp, and sequenced with a HiSeq2000 (paired-end 2 × 100 bp mode).

3C-seq libraries were prepared according to the protocol described in (Marie-Nelly et al., 2014). Briefly, queens from both lineages had their gut removed and were immediately suspended in 30 mL of formaldehyde solution (Sigma Aldrich; 3% final concentration in 1X tris-EDTA buffer). After one hour of incubation, quenching of the remaining formaldehyde was done by adding 10 mL of glycine (0.25 M final concentration) to the mix for during 20 min. The cross-linked tissues were pelleted and stored at −80°C until further use. The 3C-seq libraries were prepared using the DpnII enzyme and sequenced using an Illumina NextSeq 500 apparatus (paired-end 2×75 bp; first ten bases corresponding to custom-made tags). 3C-seq libraries are similar to Hi-C libraries except that they contain a higher percentage of paired-end reads due to the lack of an enrichment step (Flot, Marie-Nelly, and Koszul 2015).

To help annotate the genomes, three normalized RNA-seq cDNA Illumina libraries were obtained: one from an adult Chis1 queen, one from a Chis2 queen and one from a pool of brood of different stages and adult workers originating from colonies of the two lineages (HiSeq2000, paired-end 2 × 100 bp mode).

Genomes assembly

The genome of each hybridogenetic lineage was assembled independently. Nanopore data were assembled using Flye v2.7 with four iterations of polishing using long reads (Kolmogorov et al., 2019). Raw Illumina reads were trimmed for quality and adapters were removed using Trimmomatic v0.32 with options ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 (Bolger et al., 2014). The trimmed reads were then aligned to the long-read assemblies using BWA-MEM v0.7.15 (Li & Durbin, 2009). SNPs and indels with at least three supporting observations were called using freebayes v1.2 (Garrison & Marth, 2012), and error-corrected consensus sequences were obtained using BCFtools v1.4 (Li et al., 2009).

To obtain chromosome-scale assemblies, we scaffolded the polished contigs with the 3C reads using instaGRAAL, a MCMC, proximity-ligation based scaffolder (Marie-Nelly et al., 2014; Baudry et al., 2020). The 3C reads were trimmed using cutadapt (Martin, 2011) and subsequently processed using hicstuff (https://zenodo.org/record/4722873) with the following parameters – aligner bowtie2 –iterative –enzyme DpnII. The instaGRAAL scaffolder was run on the pre-processed data for 100 cycles (parameters: level 4, with options --coverage-std 1 –level 4 –cycles 100) (Baudry et al., 2020) and final scaffolds were obtained using the instaGRAAL -polish script, with all corrective procedures at once (only one parameter: -m polish). Briefly, instaGRAAL explores the chromosome structures by testing the relative positions and/or orientations of DNA segments (or bins) according to the contacts expected given a simple three-parameter power-law model. These modifications take the form of a fixed set of operations (swapping, flipping, inserting, merging, etc.) of bins corresponding to 3⁴ = 81 DpnII restriction fragments. The likelihood of the model is then maximized by sampling the parameters using a MCMC approach (Marie-Nelly et al., 2014). After 100 iterations (i.e., a likely position for each bin is tested 100 times), the genome structure converges towards a relatively stable structure that does not evolve anymore when more iterations are added, resulting in chromosome-level scaffolds. The algorithm is probabilistic and ignores initially part of the intrinsic structure of the original contigs in order to sample a larger range of genome space (Baudry et al., 2020). Therefore, some trustworthy information contained in the initial polished assembly can be lost, or modified, along the way. The final correction step of instaGRAAL consists in reintegrating this lost information into the final assembly, to correct for instance local untrustworthy tiny inversions of individual bins within a contig. The contact maps of the scaffolded assemblies were built using hicstuff. Gaps created during the scaffolding process were closed using Nanopore data with four iterations of TGS-GapCloser (Xu et al., 2019) and new polished consensus sequences were obtained using BCFtools (see method above). Completeness of the assemblies were assessed at each step using BUSCO v5.2.2 with the Hymenoptera odb10 lineage (Simão et al., 2015; Waterhouse et al., 2017). We also ran KAT v2.4.1 to compare the k-mer frequencies of Illumina reads to final assemblies (Mapleson et al., 2016). To investigate differences in chromosomal arrangement among lineages, the two genome assemblies were aligned with minimap2 v2.17 (exact preset: -x asm5) and alignments were visualized using dot plots obtained with D-GENIES (Cabanettes & Klopp, 2018).