Abstract
Scorpions, a seemingly primitive, stinging arthropod taxa, are known to exhibit marked diversity in their venom components. These venoms are known for their human pathology, but also important as models for therapeutic and drug development applications. We report a high quality genome assembly and annotation of the striped bark scorpion, Centruroides vittatus, created with several shotgun libraries. The final assembly is 760 Mb in size, with a BUSCO score of 97.8%, a 30.85% GC, and a N50 of 2.35 Mb. We estimated 36,189 proteins with 37.32% assigned to GO terms in our GOanna analysis. We were able to map 2011 and 60 venom toxin genes to contigs and scaffolds, respectively. We were also able to identify expression differences between venom gland (telson) and body tissue (carapace) with 19 Sodium toxin and 14 Potassium toxin genes to 18 contigs and two scaffolds. This assembly along with our transcriptomic data, provides further data to investigate scorpion venom genomics.
Introduction
Scorpions are an ancient and diverse arthropod taxa primarily known for their medical importance and seemingly little morphological change over millions of years (Sharma et al. 2015, Lourenco 2018). Although all scorpions have a similar bauplan, they show immense variation in their venom components (Sunagar et al. 2013, Sharma et al. 2015, Housley et al. 2017, Lourenco 2018). The scorpion genus Centruroides constitutes the most medically important and one of the most diverse and wide-ranging scorpion taxa in North America (Gantenbein et al. 2001, Santibáñez-López et al. 2016, Esposito & Prendini 2019). The complex evolutionary history and geographic variability of this genus has generated controversy and taxonomic confusion (Sissom 1990, Borges et al. 2012, Santibáñez-López et al. 2016). The diversity of the venom also implies remarkable evolutionary adaptations with new and varied constituents discovered annually (Santibáñez-López et al. 2015, Housley et al. 2017). In spite of their medical importance, scorpion genomics has lagged behind venom transcriptomics and proteomics with only three of the estimated 2500 worldwide scorpion species with genome assembly entries in the NCBI database.
The scorpion Centruroides vittatus encompasses a large geographic range across the western USA and northern United Mexican States (Figure 1). Although a member of the toxic Centruroides genus, this species is not known as medically important (Kang & Brooks 2017). However, due to coevolution with mammalian predators, evidence suggests that western C. vittatus populations may possess a more medically significant venom than eastern populations (Rowe & Rowe 2008, Bowman et al. 2021). Throughout its geographic range, C. vittatus is commonly found in diverse ecological habitats, but populations across the northern and eastern geographic distributions appear to prefer dry, rocky south facing slopes or glade areas. Human introduction of this scorpion appears to also have created additional populations outside its known geographic range (Shelley & Sissom 1995). Here we present the assembled, annotated genome of C. vittatus. Integration of genome and transcriptome data show novel splicing and transcriptional activity around venom gene regions. Furthermore, this genome will complement the deposited genome of the more noxious western C. sculpturatus, and expand on analysis of this ancient taxon.
Approximate geographic range of C. vittatus with a red asterisk identifying the location for the individuals collected for the genomic analysis (from Yamashita et al. 2013).
Materials and methods
Genome sequencing and assembly
Total genomic DNA was extracted from four scorpions collected in Pope County, AR with the Qiagen genomic-tip and genomic DNA buffer set (Qiagen, Inc.). The genomic DNA quality and quantity was analyzed through 0.9% agarose gel electrophoresis, Qubit, and UV spectroscopy. One genomic DNA sample was sent to the University of Arkansas for Medical Sciences DNA Sequencing Core Facility for 300 base paired-end sequencing on a Illumina MiSeq. Two genomic DNA samples were sent to the National Center for Genome Resources (NCGR, NM) for PacBio 20K library generation and sequencing on 10 SMRT cell for each individual genome (Supp Table 1 & 2). A final sample was sent to the CTPR Genomics Lab at Arkansas Children’s Hospital Research Institute for a 300-cycle mid-output Illumina NextSeq genome sequencing.
The de novo assembly was conducted at the Arkansas High Performance Computing Center at the University of Arkansas. Sequence read data quality control check for Illumina short reads was conducted with FastQC (0.11.5; Andrews 2010) and trimmed with Trimmomatic (0.39; Bolger et al. 2014). For PacBio CLR long reads, NanoPlot (1.0.0; De Coster 2018) and FiltLong (0.2.0; Wick 2018) were utilized. The MiSeq reads were incorporated into the Flye assembly to correct the long PacBio reads. The two PacBio CLR long reads and the Illumina NextSeq quality trimmed reads were assembled with several software tools: MaSuRCA (V3.4.0; Zimin et al. 2013), Flye (V2.8.1; Kolmogorov et al. 2019), and also a version that was error corrected using Ratatosk (V0.1; Holley et al. 2020) and the Illumina data before assembly with Flye also utilizing consensus polishing via the tool incorporated with Flye. Draft assemblies were evaluated by two criteria: (i) the N50 statistic from contigs’ size, using QUAST v.5.0.2 (Gurevich et al., 2013), and (ii) the completeness score based on the presence of universal single copy ortholog genes, using BUSCO v.4.1.0 (Manni et al., 2021) against Arachnida ortholog dataset 10 (arachnida_odb10). Lastly, we identified and removed a unique Mycoplasma genome from our reads (Yamashita et al. 2019). The Mycoplasma genome was identified from the PacBio genomic sequence assembly as a unique 683,827 bp contig with a distinct GC content (43.7%) compared to the 30.85% GC content calculated for the scorpion contigs.
Transcriptome assembly and annotation
Two male and one female scorpion were collected in northwest Arkansas, fed crickets with visual conformation of prey envenomation, then after three days, harvested for telson (venom gland) and carapace (body tissue) transcriptome analysis. The scorpions were flash frozen at −80 °C and total RNA extracted with a Trizol preparation (Sigma-Aldrich, St. Louis, MO, USA). RNA sample qc was analyzed through electrophoresis with an Aligent TapeStation system. RNA-seq with 50 bp reads was conducted at the University of Delaware on an Illumina genome sequencer (Illumina, Inc., San Diego, CA, USA). The data were viewed for initial quality through FastQC (v0.11.7), trimmed with Trimmomatic (v0.36; Bolger et al. 2014), and normalization of the data was performed using Trinity (v2.5.1; Haas et al. 2013). Assembly of the normalized reads was then performed with the following de novo assembly programs: Trinity (v2.5.1), SOAPdenovo2 (v2.4.1; Li et al. 2009), Velvet (v1.2.10; Zerbino et al. 2008), and TransAbyss (v1.5.4; Robertson et al. 2010) resulting in four individual assemblies. The transcriptome assemblies were then aggregated together using EviGene (Gilbert 2019), to remove redundancies, pics the best representatives, and filter out misassemblies. The final assembly was mapped to the genome assembly using NGen and quantified as RPKM using ArrayStar. RPKMs for contigs identified by BLASTn of the assembly for key genes were extracted for each sample over all contigs matching that BLAST query. In addition, the transcriptome assembly was blasted with a query database created from NCBI scorpion toxin and our current sodium toxin databases (2133 total toxin sequences). From these Blast searches, RPKM values for the two males and female were summarized for sodium toxin RNAs with additional searches for additional scorpion toxin RNAs. Transcriptome datasets were deposited in NCBI with the following IDs: TSA: GIPT01000000, SRA: SRR11917465, BioProject: PRJNA636371, BioSample: SAMN15075759
Genome annotation
Repetitive elements were catalogued using RepeatModeler (V2.0.2a; Flynn et al. 2020) and repetitive elements masked with RepeatMasker (V4.1.2, Smit et al. 2013). The repeat masked genome was indexed and RNASeq reads from the carapace (body tissue) and the telson (venom gland) were aligned with STAR (V2.7.9a; Dobin et al. 2013) to create BAM files. Additionally, the RNASeq data was utilized to annotate the repeat masked scorpion genome with BRAKER (V2.1.6; Hoff et al. 2019). The predicted proteins from BRAKER were then utilized for a BLAST analysis of the TrEMBL database. Lastly, the polished C.vittatus genome with carapace and telson RNASeq BAM files, and the predicted, annotated polypeptides from BRAKER and the TrEMBL BLAST were loaded into IGV (V2.13.2a; Robinson et al. 2011) to view RNASeq Sashimi plots of the expression data relative to annotated exons. We also built a toxin BLASTn database with the polished C. vittatus genome against a sodium toxin query file housing 2133 sequences to further map toxin genes. The IGV visualizations to examine differential expression between carapace and telson were focused on contigs and scaffolds containing putative toxin genes based on the BLASTn toxin queries.
A functional annotation analysis used the Cyverse pipeline developed for arthropods (Saha et al. 2021). This pipeline combines outputs from GOanna (GO annotations) and InterProScan (functional motifs) as well as mapping proteins to pathways via KOBAS.
Results and discussion
Genome assembly and annotation
The four genome sequencing outputs resulted in three final de novo C vittatus genome assemblies (Table 1). Of the three final genome assemblies, the Ratatosk-Flye assembly was judged as the most complete as it exhibited the largest reduction in contig number, with the largest contig size and N50 (Table 1, Table S1). This assembly also showed the best BUSCO statistics with 97.8% complete (92.6% unique, 5.2% duplicates) (Table 2). The transcriptome basic statistics for the venom gland are presented in Table 3 with the repetitive element data in Table S2. The transcriptomic data and the annotated polypeptide data were incorporated into IGV to visualize gene expression variation between the venom gland and body tissue (Figures 2 – 4). The functional annotation workflow predicted 36,189 proteins, which is higher than the 17,364 proteins predicted in L. hesperus (Western Black Widow Spider) (Saha et al. 2021), but comparable to the C. sculpturatus protein number of 35,529. The GOanna and InterProScan results between C. vittatus and L. hesperus are comparable (Table 4.). The KOBAS output shows a marked difference with C. vittatus exhibiting higher numbers of proteins assigned to pathways and percent assigned to KEGG pathways when compared to L. hesperus (Table 5).
IGV visualization of contig 1491 (49Kb) showing putative sodium toxin gene expression mapped with respect to carapace (Panel A) and telson (Panel B). Predicted Proteins from BRAKER and the TrEMBL BLASTp results are mapped in Panel C. Gene expression variation is denoted adjacent to the pale blue blocks and suggests clustering of sodium toxin genes in this contig. The red and blue arcs in panels A & B represent splice junctions connecting exons. The red represents mRNA reads mapped to the + strand: The blue represents mRNA reads mapped to the – strand. The section of the contig that shows the five putative sodium toxin gene regions spans 35Kb with intervening genome sequence of 3 – 8Kb. The scales for mRNA read coverage are the same in panels A & B and range from 0 - 250 reads.
IGV visualization of sashimi plot for contig 1491 (49Kb) of the region shown in Fig 2A illustrating putative sodium toxin gene expression mapped with respect to carapace (A) and telson (B). The contig location is delineated in C. Predicted Proteins from BRAKER are mapped in D. Gene expression variation between the two tissues is denoted with thin red or blue bar graphs along A or B. Splice junctions are noted with red (carapace) or blue (telson) arcs connecting putative exons with the read number shown in the arcs and suggests clustering of sodium toxin genes in this contig. Arcs above the alignment track represent reads mapped to the + strand & arcs below the alignment track represent reads mapped to the – strand. The read numbers for the carapace (red) from left to right are 11, 49, 32, 36, and 131. The reads mapped to the telson (blue) from left to right are 4672, 25159, 18368, 13351, and 68955. The section of the contig that shows the five putative sodium toxin gene regions spans 35Kb with intervening genome sequence of 3 – 8Kb.
Toxin gene specifics
The BRAKER annotation of the Ratatosk – Flye assembly mapped putative toxin genes to 2011 contigs and 60 scaffolds (Table 6). The mapping of toxin genes to the assembly with the scorpion toxin Blastn file refined the genes to a subset of 848 contigs and 57 scaffolds. Further analysis of the Blastn output identified putative toxin genes in 18 contigs and 2 scaffolds in which the toxin genes showed much higher expression in the telson (venom gland) vs. the carapace, including 19 putative sodium and 14 putative potassium toxins genes.
Scorpion toxin gene expression mapped to the C. vittatus genomic assembly. The Ratatosk-Flye assembly totals represent the subset of contigs and scaffolds with identified putative toxin genes from BRAKER. The scorpion toxin Blastn file represents contigs and scaffolds with toxin gene hits from a query file of 2133 scorpion toxin genes. The total sodium and potassium toxin genes with higher telson expression mapped to the 18 contigs and 2 scaffolds from the filtered Blastn data are 19 & 14 genes, respectively.
The toxin gene mapping suggests many toxin genes are only differentially expressed in body tissue versus venom glands, rather than uniquely expressed in venom glands. One contig that spans 35Kb (contig 1491) shows five paralogs of a putative sodium toxin gene in tandem, separated by intervening genomic sequences of 3 – 8Kb, suggesting ancestral gene duplication in this region (Fig. 2a). The IGV view of contig 1491 also suggests the sodium toxin genes in this region are arranged on both + and – DNA strands, which may indicate gene inversions. The Sashimi plot of this region (Fig. 2b.) clarifies the RNAseq mapping, showing markedly higher expression in telson mRNAs in this putative sodium toxin gene region.
Other contigs showed a pattern of larger genomic regions with mapped sodium toxin genes. For example, contig 2703 shows a 1900 bp region with putative sodium toxin genes mapped, suggesting multiple sodium toxin genes in this region (Fig 3.). Putative potassium toxin genes were located on other contigs and scaffolds, six versus 15 for putative sodium toxin genes, with no evidence of duplicated regions (e.g., Fig 4.). These genes also appear to exhibit differential expression in the telson vs the carapace rather than unique expression in the telson. These initial findings support a model of recent toxin gene duplication events that may underline the incredible sodium toxin diversity in the New world Centruroides species. (Rendon-Anaya et al. 2012, Drukewitz & von Reumont 2019).
IGV visualization of a 5Kb region in contig 2703 (157Kb) showing putative sodium toxin gene expression mapped with respect to carapace (Panel A) and telson (Panel B). Predicted Proteins from BRAKER and the TrEMBL BLASTp results are mapped in Panel C. Gene expression variation is denoted with the 50bp reads shown in the grey regions and suggests clustering of sodium toxin genes in this contig. The section of the contig that shows the putative sodium toxin gene region spans 2150bp. The scales for mRNA read coverage are the same in panels A & B and range from 0 - 250 reads.
IGV visualization of a 2,300bp region in contig 82 (1,461Kb) showing putative potassium toxin (ERG1) gene expression mapped with respect to carapace (Panel A) and telson (Panel B). Gene expression variation between the two tissues is denoted with the grey regions with the 50 bp reads and the pale blue region is a putative intron gap of 1700bp between mapped reads. The scales for mRNA read coverage are the same in panels A & B and range from 0 - 250 reads.
Conclusion
We describe a genomic assembly and annotation for the scorpion Centruroides vittatus coupled with transcriptomic data mapped to contigs and scaffolds. Our assembly shows a genome of 760Mb in length, with 98% of sequences mapped to 2071 contigs. Our results also highlight the substantial toxin gene diversity in this scorpion and show toxin gene expression patterns between body tissue and the venom gland. This genome will complement the growing number of venomous species with genomes in published databases.
Data availability
The genome assembly was deposited at NCBI under accession number JASCZU000000000; BioProject PRJNA937744; BioSample SAMN33417986.
Supplementary material is available at G3 online.
Funding
Funding was provided by the Arkansas INBRE program, with a grant from the
National Institute of General Medical Sciences, (NIGMS), P20 GM103429 from the National Institutes of Health, USA.
Conflicts of interest
None declared
Acknowledgements
The authors thank Shane Sanders for conducting the RepeatModeler, RepeatMasker, STAR, and BRAKER analyses.