Abstract
Freshwater mussels (Bivalvia: Unionida) serve an important role as aquatic ecosystem engineers but are one of the most critically imperilled groups of animals. An assembled and annotated genome for freshwater mussels has the potential to be utilized as a valuable resource for many researchers given their ecological value and threatened status. In addition, a sequenced genome will help to answer more fundamental questions of sex-determination and genome evolution in bivalves exhibiting a unique “doubly uniparental inheritance” mode of mitochondrial DNA transmission through comparative genomics approaches. Here, we used a combination of sequencing strategies to assemble and annotate a draft genome of the freshwater mussel Venustaconcha ellipsiformis. The genome described here was obtained by combining high coverage short reads (65X genome coverage of Illumina paired-end and 11X genome coverage of mate-pairs sequences) with low coverage Pacific Biosciences long reads (0.3X genome coverage). Briefly, the final scaffold assembly accounted for a total size of 1.54Gb (366,926 scaffolds, N50 = 6.5Kb, with 2.3% of "N" nucleotides), representing 93% of the predicted genome size of 1.66Gb. Over one third of the genome (37.5%) consisted of repeated elements and more than 85% of the core eukaryotic genes were recovered. Finally, we reassembled the full mitochondrial genome and found six polymorphic sites with respect to the previously published reference. This resource opens the way to comparative genomics studies to identify genes related to the unique adaptations of freshwater mussels and their distinctive mitochondrial inheritance mechanism.
Introduction
Through their water filtration action, freshwater mussels (Bivalvia: Unionida) serve important roles as aquatic ecosystem engineers (Gutiérrez et al. 2003; Spooner & Vaughn 2006), and can greatly influence species composition (Aldridge et al. 2007). From a biological standpoint, they are also well known for producing obligate parasitic larvae that metamorphose on freshwater fishes (Lopes-Lima et al. 2014), for being slow-growing and long-lived, with several species reaching >30 years old and some species >100 years old (see Haag & Rypel 2011 for a review), and for exhibiting an unusual system of mitochondrial transmission called Doubly Uniparental Inheritance or DUI (see Breton et al. 2007; Passamonti & Ghiselli 2009; Zouros 2013) for reviews). From an economic perspective, freshwater mussels are also exploited to produce cultured pearls (Haag 2012). Regrettably however, habitat loss and degradation, overexploitation, pollution, loss of fish hosts, introduction of non-native species, and climate change have resulted in massive freshwater mussel decline in the last decades (reviewed in Lopes-Lima et al. 2017; 2018). For example, more than 70% of the ∼300 North American species are considered endangered at some level (Lopes-Lima et al. 2017).
While efforts are currently underway to sequence and assemble the genome of the marine mussel Mytilus galloprovincialis (Murgarella et al. 2016), genomic resources for mussels in general are still extremely scarce. In addition to M. galloprovincialis, the genomes of two other marine mytilid mussel species, i.e. the deep-sea vent/seep mussel Bathymodiolus platifrons and the shallow-water mussel Modiolus philippinarum have recently been published (Sun et al. 2017). In all cases, genomes have proven challenging to assemble due to their large size (∼1.6 to 2.4Gb) and widespread presence of repeated elements (∼30% of the genome, and up to 62% of the genome for the shallow-water mussel Modiolus philippinarum, Sun et al. 2017).
For example, the Mytilus genome remains highly fragmented, with only 15% of the gene content estimated to be complete (Murgarella et al. 2016). With respect to freshwater mussels (order Unionida), no nuclear genome draft currently exists. An assembled and annotated genome for freshwater mussels has the potential to be utilized as a valuable resource for many researchers given the biological value and threatened features of these animals. Such studies are needed to help identifying genes essential for survival (and/or the genetic mechanisms that led to decline) and ultimately for developing monitoring tools for endangered biodiversity and plan sustainable recoveries (Pavey et al. 2016; Savolainen et al. 2013). In addition, a sequenced genome will help answer more fundamental questions of sex-determination (Breton et al. 2011; 2017) and genome evolution through comparative genomics approaches (e.g. Sun et al. 2017).
Given the challenges in assembling a reference genome for saltwater mussels (Sun et al. 2017; Murgarella et al. 2016), we used a combination of different sequencing strategies (Illumina paired-end and mate pair libraries, Pacific Biosciences long reads, and a recently assembled reference transcriptome (Capt et al. 2018) to assemble the first genome draft in the family Unionidae. Hybrid sequencing technologies using long read–low coverage and short read–high coverage offer an affordable strategy with the advantage of assembling repeated regions of the genome (for which short reads are ineffective) and circumventing the relatively higher error rate of long reads (Koren et al. 2012; Miller et al. 2017). Here, we present a de novo assembly and annotation of the genome of the freshwater mussel Venustaconcha ellipsiformis.
Methods
To determine the expected sequencing effort to assemble the Venustaconcha ellipsiformis genome, i.e., the necessary software and computing resources required, we first searched for C-values from other related mussel species. C-values indicate the amount of DNA (in picograms) contained within a haploid nucleus and is roughly equivalent to genome size in megabases. Two closely related freshwater mussel species (Elliptio sp., c-value = 3; Uniomerus sp., c-value = 3.2), in addition to two other well studied mussel groups (Mytilus spp., c-value = 1.3-2.1; Dreissena polymorpha, c-value = 1.7) were identified on the Animal Genome Size Database (http://www.genomesize.com). As such, we estimated the Venustaconcha genome size to be around ∼1.5-3.0Gb, and this originally served as a coarse guide to determine the sequencing effort required, given that when the sequencing for Venustaconcha was originally planned, no mussel genome had yet been published.
Mussel specimen sampling, genomic DNA extraction and library preparation
Adult specimens of Venustaconcha ellipsiformis were collected from Straight River (Minnesota, USA; Lat 44.006509, Long -93.290899) and sexed by microscopic examination of gonad smears. Gills were dissected from a single female individual and genomic DNA was extracted using a Qiagen DNeasy Blood & Tissue Kit (QIAGEN Inc., Valencia, CA, USA) using the animal tissue protocol. The quality and quantity of DNA, respectively, were assessed by electrophoresis on 1% agarose gel and with a BioDrop mLITE spectrophotometer (a total of 15 µg of DNA was quantified using the spectrophotometer). For whole genome shotgun sequencing and draft genome assembly, we used two sequencing platforms: Illumina (San Diego, CA) Hiseq2000 and Pacific Biosciences (Menlo Park, CA) PacBio RSII. First, three paired-end libraries with insert size of 300b were constructed using Illumina TruSeq DNA Sample Prep Kit. One mate pair library with insert sizes of about 5Kb was constructed for scaffolding process using Illumina Nextera mate-pair library construction protocol. For high-quality genome assembly, Pacific Biosciences system was employed for final scaffolding process using long reads. Pacific Biosciences long reads (>10Kb) were generated using SMRT bell library preparation protocol (ten SMRT cells were sequenced). Construction of sequencing libraries and sequencing analyses were performed at the Genome Quebec Innovation Centre (McGill University, Qc, Canada).
Pre-processing of sequencing reads
We quality trimmed paired-end and mate-pair reads using TRIMMOMATIC 0.32 (Bolger et al. 2014) with the options ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:6:10 MINLEN:36. This allowed removal of base pairs below a threshold Phred score of three at the leading and trailing end, in addition to removing base pairs based on a sliding window calculation of quality (mininum Phred score of ten over six base pairs). Finally, if trimmed reads fell below a threshold length (36b), both sequencing pairs were removed. We verified visually the quality (including contamination with Illumina paired-end adaptors) before and after trimming using FASTQC (Andrews 2010). This allowed us to only keep high quality reads prior to the assembly steps.
Following quality trimming, we used BFC (Li & Durbin 2009) to perform error correction for the Illumina paired-end sequencing data. BFC suppresses systematic sequencing errors, which helps to improve the base accuracy of the assembly and reduce the complexity of the de Bruijn graph based assembly, described below.
Corrected paired-end reads were subsequently used to identify the optimal K value that provides the most distinct genomic k-mers using KMERGENIE v1.7016 (Chikhi & Medvedev 2014). We tested k = 10 to 100, in incremental steps of 10, and we then refined the interval from 20 to 40, in incremental steps of 2 to get a more precise estimate of K. Based on the best K value (k=42), KmerGenie was also used to estimate genome size.
Genome assembly strategy
We used ABYSS 2.0 (Jackman et al. 2017), a modern genome assembler specifically built for large genomes and reads acquired by different sequencing strategies. ABYSS 2.0 works similarly to ABYSS (Simpson et al. 2009), by using a distributed de Bruijn graph representation of the genome, therefore allowing parallel computation of the assembly algorithm across a network of computers. In addition, the software makes use of long sequencing reads (Illumina mate-pair libraries and Pacific BioSciences long reads) to bridge gaps and scaffold contigs. Yet, as memory requirements and computing time scale up exponentially with genome size, for large genomes (>1Gb), these rapidly become very large (>100GB of RAM) and unpractical. Consequently, Jackman et al. (2017) introduced ABYSS 2.0, which employs a probabilistic data structure called a Bloom filter (Bloom 1970) to store a de Bruijn graph representation of the genome and, consequently, greatly reduces memory requirements and computing time. The Bloom filter allows removing from memory the majority of nearly identical k-mers likely caused by sequencing errors, as k-mers with an occurrence count below a user-specified threshold are discarded. The caveat is that it can generate false positive extension of contigs, but through optimization, this can be kept well below 5%, and in fact, false positives can be corrected later on in the assembly step (Jackman et al. 2017).
In the current study, we combined different types of high throughput sequencing to aid in assembling the genome (Table 1). ABYSS 2.0 (Jackman et al. 2017) performs a first genome assembly step without using the paired-end information, by extending unitigs until either they cannot be unambiguously extended or come to an end due to a lack of coverage (uncorrected unitigs). This first de Bruijn graph representation of the genome is further cleaned of vertices and edges created by sequencing errors (unitigs). Paired-end information is then used to resolve ambiguities and merge contigs. Following this, mate-pairs are mapped onto the assembly to create scaffolds, and finally long reads (Pacific Biosciences long reads) and the Venustaconcha reference transcriptome from Capt et al. (2018) were also mapped onto the assembly to create long-scaffolds. This reference transcriptome was assembled from a pool of sequences coming from four different male and female individuals and further details are provided in Capt et al. (2018). Although ideally sequencing information would all come from a single individual, the current study design did not allow for this. In addition, given that coding sequences are conserved compared to non-coding regions, it remains highly valuable to use a transcriptome in a de novo genome assembly.
We ran the ABYSS 2.0 assembly stage (abyss-bloom-dbg) with a k-mer size of 41 (ABYSS requires an odd number k-mer), a Bloom filter size of 24GB, 4 hash functions and a threshold of k-mer occurrence set at 3. These parameters were chosen after performing several test assemblies, in order to minimize the false positive rate (<5%), maximize the N50 of the assembly and keep the virtual memory (95GB) and CPU (24 CPUs) requirements within a reasonable computational limit for our resources. In addition, we adjusted parameters at the mapping stage to create contigs, scaffolds and long-scaffolds to maximize N50 (overlap required in re-alignments, distance between mate-pairs, nb reads aligned to support assembly, see pipeline available at https://github.com/seb951/venustaconcha_ellipsiformis_genome).
Genome completeness was assessed using BUSCO 3.0.2 (Benchmarking Universal Single-Copy Orthologs, Simao et al. 2015). Briefly, BUSCO uses curated lists of known core single copy orthologs to produce evolutionarily-informed quantitative measures of genome completeness (Simao et al. 2015). Here, we tested both the eukaryotic (303 single copy orthologs) and metazoan (978 single copy orthologs) gene lists to assess the completeness of our genome assembly.
Characterization of repetitive elements
Given that repetitive elements can occupy large proportions of a genome, the characterization of their proportion and composition is an essential step during genome annotation. RepeatModeler open-1.0.10 (Smit & Hubley 2015) was used to create an annotated library of repetitive elements contained in the Venustaconcha genome assembly (excluding sequences <1Kb). Then, with RepeatMasker open-4.0.7 (Smit et al. 2015), we extracted libraries of repetitive elements for the taxa “Bivalvia” and “Mollusca” from the RepeatMasker combined database (comprising the databases Dfam_consensus-20170127 and RepBase-20170127) using built-in tools. Sequences classified as “artefact” were removed from the last two libraries before the subsequent steps. The three libraries were used alone and/or in combination (except for the Mollusca+Bivalvia combination) to mask the cut-down assembly again with RepeatMasker, specifying the following options: -nolow (to avoid masking low complexity sequences, which may enhance subsequent exon annotation), -gccalc (to calculate the overall GC percentage of the input assembly), -excln (to exclude runs of ≥20 Ns in the assembly sequences from the masking percentage calculations). Option - species was used to specify the taxon for the runs with Bivalvia and Mollusca libraries, while option -lib used to specify the Venustaconcha library and the combined ones. Results summaries for the latter three runs were refined with the RepeatMasker built-in tools. Linear model fit for genome size and repeats content for all available bivalve genomes were calculated with R version 3.1.0 (R Core Team 2012), using the highest masking value found for Venustaconcha.
Genome annotation
We used QUAST (Gurevich et al. 2013) to calculate summary statistics on the genome assembly. In addition, QUAST uses a Hidden Markov Model to identify putative genes in the final assembly (GLIMMERHMM Majoros et al. 2004). Following this, we translated Open Reading Frames identified in the annotation files into protein sequences using BEDTOOLS V2.27.1 (Quinlan & Hall 2010) and EMBOSS TRANSEQ V6.6.0 (Rice et al. 2000) bioinformatics pipelines. These were then compared against the manually curated UniProt database (556,388 reference proteins, downloaded January 11th 2018, e-value cut-off of 10−5) using BLASTp (Altschul et al. 1990). These steps were done on the long-scaffolds assembly, the masked long-scaffolds assembly (with low complexity regions replaced with N), in addition to the broken long-scaffolds assembly (scaffolds broken into smaller contigs by QUAST, based on long stretches of N nucleotides).
Mitochondrial genome
Given the rare mode of mitochondrial inheritance of freshwater mussels and therefore its evolutionary importance, we first aimed to check if the mitochondrial female genome had been properly assembled. Using BLASTn (Altschul et al. 1990) with high stringency (E value <1e-50), we identified a fragmented mitochondrial genome. We then created a mt specific dataset containing 1,396,004 sequence reads by aligning paired-end reads to the reference mt genome of Breton et al. (2009) (GenBank Acc. No. FJ809753) using SAMTOOLS V1.3.1 and BEDTOOLS V2.27.1 (Li et al. 2009; Quinlan & Hall 2010). We then rebuilt the mt genome de novo using ABYSS 2.0, testing different k-mers (17-45). In addition, we aligned reads to the reference transcriptome using BWA V0.7.12-R1039 (H Li & Durbin 2009) and identified Single Nucleotide Polymorphisms (SNPs) with respect to the reference mt genome using SAMTOOLS and BCFTOOLS v1.3.1 (Li et al. 2009).
Results and Discussion
We generated 564M paired-end reads (2 X 100b) representing an average 65X coverage of the genome (Table 1). This was complemented by 98M mate-pairs (5Kb insert, 11X average genome coverage) and 103,000 Pacific Biosciences long reads (0.3X average genome coverage), and a recently published reference transcriptome comprised of 285,000 contigs (Capt et al. 2018). Filtering and trimming the raw paired-end and mate-pair sequences removed about 5% of the total base pairs from further analyses, indicating that the quality of the raw sequences was high (Table 1). K-mer analysis indicated that the number of unique k-mers peaked at 42 and predicted a genome assembly size of 1.66Gb (Figure 1), smaller than predicted genome size according to C-value for other Unionida, but in general agreement with the recent draft genome of the marine mussel Mytilus galloprovincialis (1.6Gb) and the deep-sea vent/seep mussel (Bathymodiolus platifrons, 1.64Gb).
Running the ABySS 2.0 assembly stage (abyss-bloom-dbg) led to a low False Positive Rate (<0.05%). The N50 for the contig assembly was 3.2Kb with 551,875 contigs (discarding contigs <1Kb, given that small contigs likely represent artefacts and provide little information for the overall genome assembly (Pavey et al. 2016; Murgarella et al. 2016, see Table 2). Once these were corrected and paired-end, mate-pairs and long read information were added, the scaffolds N50 increased to 5.5Kb, with 2.3% of nucleotides represented as “N” (see Table 2 for the summary statistics and Table 3 for overall genome assembly statistics acquired from QUAST analysis). Adding the Pacific Biosciences long reads only slightly improved the scaffolds N50 (from 5.5 to 5.7Kb, Table 2) and slightly decreased the number of long-scaffolds >1Kb (from 423,853 to 410,237), likely because our long read coverage was quite low (0.3X, Table 1). In addition, it is also possible that the more error prone Pacific Biosciences sequences, compared to Illumina paired-end reads, reduced their usability (Miller et al. 2017). Once the reference transcriptome was added, it improved the N50 to 6.5Kb, and substantially decreased the number of long-scaffolds to 366,926. This final long-scaffold assembly accounted for a total size of 1.54Gb (with 2.3% of "N" nucleotides) and represented 93% of the predicted genome size of 1.66Gb. Yet, it remained highly fragmented (366,926 scaffolds, Table 2). Genome annotation statistics can also be viewed in html format and downloaded here: https://github.com/seb951/venustaconcha_ellipsiformis_genome/tree/master/annotation_quast_v3
While assembly numbers (N50, number of scaffolds, etc.) are not directly comparable with other recently published genomes given the diversity of sequencing approaches (Illumina, 454, Sanger, PacBio), library types, sequencing depth and unique nature of the genome themselves, they can give a broad perspective of the inherent difficulties of assembling large genomes. The best comparison is probably with the saltwater mussel, Mytilus galloprovincialis, giving their similar genome size (1.6Gb for Mytilus vs 1.66Gb for Venustaconcha) and Illumina paired-end sequencing approaches (32X for Mytilus vs 65X for Venustaconcha). While the Mytilus genome project (Murgarella et al. 2016) did not utilize mate-pair libraries or Pacific Bioscience long reads, they did make use of sequencing libraries with varying insert sizes (180, 500 and 800b). As such, they obtained a genome assembly quality relatively similar to ours and consisting of 393 thousand scaffolds (>1Kb), with however a substantially lower N50 (2.6Kb compared to 6.5Kb for Venustaconcha). The recently reported genome for the deep-sea vent/seep mussel Bathymodiolus platifrons (1.64Gb) made use of nine Illumina sequencing libraries with varying insert sizes (180 to 16Kb) and an overall coverage of >300X. With this very thorough sequencing approach, the scaffold N50 obtained was substantially higher (343.4Kb), but again the genome remained highly fragmented, into >65 thousands scaffolds. As exemplified here, high coverage sequencing libraries with varying insert sizes have become a broadly used approach for large and complex genomes. In fact, it is implemented by default in many genome assembly platforms (e.g. SoapdeNovo2, Luo et al. 2012, ALLPATHS-LG, Gnerre et al. 2011). In the future, these libraries will likely be useful to further assemble the Venustaconcha genome, at least until these approaches are superseded by affordable, error free, single molecule long read sequencing (Gordon et al. 2016; Badouin 2017) or mapping approaches that allow reaching chromosome level assemblies such as optical mapping (e.g. Bionano Genomics, San Diego, CA).
Results of the BUSCO (Simao et al. 2015) analyses showed that 664 (68%) of the 978 core metazoan genes (CEGs) were considered complete in our assembly. When the BUSCO analysis was extended to include also fragmented matches, 871 (89%) proteins aligned. Results were similar when compared against the 303 core eukaryotic genes (61% complete, 86% complete or fragmented, Table 4). When compared to the previously published reference transcriptome for Venustaconcha ellipsiformis (Capt et al. 2018), we found fewer complete genes, but also fewer duplicated genes (97.5% complete, and 24% duplicated in the reference transcriptome, compared to 68.1% complete and 1% duplicated here). This likely reflects the fact that the reference transcriptome is nearly complete, while the current reference genome is still fragmented. However, the reference transcriptome also likely contains multiple isoforms of the same genes, in addition to possible nematode contaminating sequences, despite the authors’ best efforts to minimize these problems. Previously analysed molluscan genomes of similar size (Murgarella et al. 2016; Sun et al. 2017) have found that 16% (Mytilus galloprovincialis, 1.6Gb), 25% (pearl oyster Pinctada fucata, 1.15Gb), 36% (California sea hare Aplysia californica, 1.8Gb) of the core eukaryotic genes were complete. For their part Sun and collaborators (2017), identified 96% of the core metazoan genes to be partial or complete in the deep-sea vent/seep mussel Bathymodiolus platifrons (1.6Gb), again reflecting that the depth and type of sequencing, in addition to the idiosyncrasies of each genome, can have considerable influence on the end results.
The custom Venustaconcha repeat library created de novo with RepeatModeler contained 2,068 families, the majority of them (1,498, 72.44% of the total) classified as “unknown”. The genome masking performed with the Bivalvia and Mollusca libraries had scarce performances (masking 2.38% and 2.59%, respectively; details in Supplementary Table RM1), possibly because of the phylogenetic distance between V. ellipsiformis, which belongs to the early-branching bivalve lineage of Palaeoheterodonta, and the other bivalve and mollusk species represented in the database as well as their relative number of sequences. The custom Venustaconcha library masked 37.17% of the genome, while the combined Venustaconcha+Bivalvia masked 37.69% of the genome and the Venustaconcha+Mollusca reached 37.81%, the highest masking percentage (Supplementary Table RM2). After refining, these raw values slightly decreased to respectively 36.29%, 36.80%, and 36.91% (Supplementary Table RM3). All these latter values of repeat content fall in the 32-39% range (the median for all species is 37%) where six out of the nine sequenced bivalve species lie, irrespective of their genome size (M. philippinarum and R. philippinarum are the furthest from this interval) (Table 5 and Supplementary Figure 1). Although the number of species sequenced up to now is still low, this observation indicates that repetitive elements may contribute differently to the total genome size among the different bivalve taxa: indeed, the correlation between genome size and repeats content is weak (Supplementary Figure 1). In both the ab initio masking with the Venustaconcha library and the two combined ones, most of the identified repeats are categorized as “unknown” (22.8% of the assembly), followed by retroelements (LINEs 2.9%, LTR elements 2.3-2.4%, and SINEs 1.7%, for a total of 6.9% of the assembly) and DNA elements (5.4-5.6% of the assembly) (Supplementary Table RM3). Direct comparisons of these values with other species should be performed with caution, as the usually large “unclassified” portion of repeats might contain species-specific variants of known elements (Murgarella et al. 2016) that may therefore change the relative weight of each category on the total.
QUAST was used to calculate summary statistics and identify putative genes in the final assembly using a hidden markov model (Table 3). Following this, 29,031; 14,195 and 25,544 Open Reading Frames were annotated using BLASTp against UniProt database in the long-scaffolds, broken and masked long-scaffolds assemblies, respectively.
Freshwater mussels, marine mussels, as well as marine clams are the only known exception in the animal kingdom with respect to the maternal inheritance of mitochondrial DNA (see Breton et al. 2007 for a review). Their unique system, characterized by the presence of two gender-associated mitochondrial DNA lineages, has therefore attracted studies to better understand mitochondrial inheritance and the evolution of mtDNA in general. Using BLASTN, we recovered 53 contigs matching to the 15,975b female reference mt genome from Breton et al. (2009), indicating that the mt genome was highly fragmented and likely improperly assembled with our current approach, much like what was found in the Mytilus galloprovincialis genome draft of Murgarella (Murgarella et al. 2016). As such, we created a dataset of mt specific sequences that could be aligned to the mt genome (1,396,004 reads). This mt specific dataset was then re-assembled de novo, using different k-mers (17-45). Using a k-mer similar or larger to the one used in the overall assembly (k≥41) resulted in a failed assembly (no contigs created, data not shown), while using a k-mer <21 generated a highly fragmented mt genome (data not shown). Using a k-mer between 21 and 39 generated one large contig of 16,024b comprising the entire mitogenome, with a 42b insertion in the 16S ribosomal RNA. Given the different rate of evolution of mtDNAs, it is likely that assembly parameters we used for the whole genome were not appropriate for the V. ellipsiformis female mt genome. Finally, we also re-aligned the mt specific dataset to the original mt genome of Breton et al. (2009) and found high coverage (mean = 7,256X, SD = 682) for most positions, while for three regions coverage dropped below 300X (Figure 2). Six SNPs with respect to the reference were also identified, indicating possible polymorphism, or sequencing error in the original mt reference genome (Figure 2).
Conclusion
High throughput sequencing has the power to produce draft genomes that were only reserved to model systems ten years ago. Here we report the first de novo draft assembly of the Venustaconcha ellipsiformis genome, a freshwater mussel from the bivalve order Unionida. Our assembly covers over 93% of the genome and contains nearly 90% of the core eukaryotic orthologs, indicating that it is nearly complete. However, as for other mussel genomes recently published, our genome remains fragmented, showing the limits of high throughput sequencing and the necessity to combine different sequencing approaches to augment the scaffolding and overall genome quality, especially when a large fraction of the genome is comprised of repetitive elements. In the future, the Venustaconcha genome will benefit from a larger number of long read sequences, varying library size for paired-end sequencing, and the use of genetic, physical or optimal maps to subsequently order scaffolded contigs into pseudomolecules or chromosomes.
Data availability
Supporting data for this Genome Report will be made available on datadryad.org Raw sequences are available in the SRA database with number SRP132483 (submission SUB3624229 to be release upon publication) and Bioproject accession PRJNA433387. All scripts used in the analyses are available on github (https://github.com/seb951/venustaconcha_ellipsiformis_genome).
Acknowledgments
Computations were made on the supercomputer briaree from Université de Montréal, managed by Calcul Québec and Compute Canada. The operation of this supercomputer is funded by the Canada Foundation for Innovation (CFI), the ministère de l'Économie, de la science et de l'innovation du Québec (MESI) and the Fonds de recherche du Québec - Nature et technologies (FRQ-NT).
Abbreviations
- BLAST
- Basic Local Alignment Search Tool
- b
- base pairs
- Kb
- Kilobases
- M
- Million
- Gb
- Gigabases
- GB
- gigabytes
- CPU
- Central Processing Unit
- DNA
- Deoxyribonucleic acid
- LINEs
- Long interspersed elements
- LTR
- Long terminal repeats
- ORF
- Open Reading Frames
- N80/50/20
- weighted median statistic such that 80/50/20% of the entire assembly is contained in contigs/scaffolds equal to or larger than this value.
- L50
- minimum number of sequences required to represent 50% of the entire assembly
- RAM
- Random Access Memory
- SINEs
- Short interspersed elements