The Genome of the Zebra Mussel, Dreissena polymorpha: A Resource for Invasive Species Research

The zebra mussel, Dreissena polymorpha, continues to spread from its native range in Eurasia to Europe and North America, causing billions of dollars in damage and dramatically altering invaded aquatic ecosystems. Despite these impacts, there are few genomic resources for Dreissena or related bivalves, with nearly 450 million years of divergence between zebra mussels and its closest sequenced relative. Although the D. polymorpha genome is highly repetitive, we have used a combination of long-read sequencing and Hi-C-based scaffolding to generate the highest quality molluscan assembly to date. Through comparative analysis and transcriptomics experiments we have gained insights into processes that likely control the invasive success of zebra mussels, including shell formation, synthesis of byssal threads, and thermal tolerance. We identified multiple intact Steamer-Like Elements, a retrotransposon that has been linked to transmissible cancer in marine clams. We also found that D. polymorpha have an unusual 67 kb mitochondrial genome containing numerous tandem repeats, making it the largest observed in Eumetazoa. Together these findings create a rich resource for invasive species research and control efforts.


Introduction
previously published 26 , but contained a gap which short-read sequencing and targeted PCR were unable to resolve. PacBio and Oxford Nanopore sequencing (Fig. 2c) reveals that this both mtDNAs, as the M-type is transmitted exclusively through male germline, while in somatic 175 tissues the F-type is predominant 53,54 .

191
characterized in an earlier study of HTT events 57 . Our phylogenetic analysis identified a 192 minimum of three HTTs leading to their spread to zebra mussels from marine bivalves (Fig. 3b),

193
including an event that is additional to and independent of the two HTTs identified previously 57 .

194
It is unknown whether SLEs are currently undergoing active transposition within zebra mussels.

195
However, the high levels of sequence similarity between Gag-pol regions of different SLE loci,196 and between the two LTRs of each SLE, indicates that the latest wave of transposition in this 197 genome was recent. We also identified numerous degenerate copies that are missing portions 198 of Gag-Pol or LTR sequences, as well as isolated LTR scars on most chromosomes 199 (Supplemental Fig. 8).
enrichment analysis also showed that the chitin binding molecular function was significantly 241 enriched in the mantle-specific genes, along with a number of peptidase inhibitors (Fig. 4d).

242
Among the most specific and highly expressed mantle genes in D. polymorpha were two 243 genes with sequence similarity to temptin, a pheromone that serves as a chemoattractant for 244 mating in Aplysia 70 . Zebra mussels attach to one another in clusters known as druses.

245
Settlement of larvae near adults 71 and gregarious post-settlement behaviors 72 create massive 246 aggregations on lake and river bottom. These behaviors increase settlement success, enable 247 "habitat engineering" in mussel beds 72 , and may enhance feeding and fertilization success 73-75 . 248 Further investigation will be required to determine if the D. polymorpha temptin orthologs serve 249 as aggregation cues or play some other unknown sensory role.

252
The fibers that zebra and quagga mussels use to anchor themselves to hard surfaces are 253 known as byssal threads. These are key innovations (absent from native North American and 254 European freshwater mollusks) used to attach to conspecific mussels, and to native unionid 255 mussels and other benthic animals, which can be smothered and outcompeted. Byssal 256 attachment to boat hulls, docks, boat lifts and other recreational equipment allows rapid rates of 257 spread between water bodies [76][77][78] . Expression of genes during byssogenesis has been studied 258 in zebra mussels 27 but a majority of mRNAs that are up or down-regulated could not be 259 identified.

260
Previous work identified a full byssal protein cDNA sequence (named Dpfp1) 79,80 and peptide 261 fragments from a second foot byssal protein 81 . More recent proteomic work also identified 262 peptide tags associated with several D. polymorpha foot proteins (Dpfps) that are secreted by 263 the foot and together form the stem, threads and attachment plaques (Fig. 5) of the byssus 82,83 .

264
The zebra mussel genome provided significant additional information to identify full-length 265 proteins corresponding to the Dpfp peptide tags and to gain further insight into their function 266 (Fig. 5). BLAST was used to align known Dpfp polypeptides against the genome-predicted 267 protein sequences to determine the full Dpfp's (Supplemental File 12). Dpfp2 and Dpfp12 were 268 found to be the C-and N-terminal regions, respectively, of a single protein (Dpfp2). Similarly,

269
Dpfp6 was found to be the C-terminal region of Dpfp1. Surrounding genomic regions on 270 chromosome 8 were searched, and multiple exons were identified that are likely spliced to 271 create mature Dpfp1 and Dpfp 6 (Supplemental File 12). Complete coding sequences were resolved for Dpfp5 and Dpfp8, and Dpfp8 was found to lack a signal peptide. Dpfp8 shows 273 similarity to Staphylococcus saprophyticus cell wall associated fibronectin-binding protein negative regulator of TNF-signaling). The TNF pathway regulates inflammation and apoptosis, 288 suggesting that production of the byssal thread may induce stress in the surrounding tissues 289 and that this stress response may be actively suppressed. Consistent with this, both a cytokine 290 receptor and the pro-apoptotic Bcl2-like gene are down regulated at the day-four time point.

291
While earlier expression studies found otherwise 27,82,83 some byssal proteins were absent from 292 our differentially expressed gene set. And while some of these proteins are differentially 293 distributed across the byssus, localized expression in the foot has not been studied.

294
Nevertheless, one explanation is that our dissections missed the secretory cells more proximal 295 to the threads, a possibility that awaits testing.

297
Thermal tolerance and chronic heat stress

298
In Dreissena, broad thermal tolerance and ability to adjust to local conditions have clearly 299 played a role in invasion success. Zebra mussels have higher lethal temperature limits and 300 spawn at higher water temperatures in North America than in Europe 84,85 . In the Lower

301
Mississippi River zebra mussels are found south to Louisiana. There they lack cooler water 302 refuges, and persist near their lethal limit of 29-30°C for 3 months during the summer, while for months 87 . Seasonal scheduling of growth and reproductive effort appears to be responsible for 306 at least some of the adaptation or acclimation to conditions in the lower river, as populations in 307 Louisiana shift their shell and tissue growth to the early spring and stop growing in summer 86 308 while more northerly populations grow tissue and spawn in summer months 88,89 .

309
In order to identify genes involved in the response to thermal stress, we generated 310 transcriptomes from gill tissue in animals exposed to periods of low (24°C), moderate (27°C), 311 and high (30°C) chronic temperature stress (Fig. 6a). Moderate thermal stress led to the 312 induction of a number of genes involved in cellular adhesion or cytoskeletal remodeling,

324
Here we describe the genome of the zebra mussel. Consistent with the genomes of other 325 bivalves, the D. polymorpha genome is highly repetitive and encodes an expanded set of heat-326 shock and anti-apoptotic proteins, presumably to deal with the challenges of a sessile existence.

327
We examine the genetic underpinnings of several traits that have been linked to population 328 growth and invasiveness, including shell and byssal thread formation, and response to thermal 329 stress. While these analyses uncovered multiple genes and pathways that seem to function in a 330 conserved manner across multiple bivalve species, they also uncovered a large number of 331 genes of unknown function. In the future, it will be of considerable value to compare the zebra 332 mussel genome with that of its congener, the quagga mussel (D. rostriformis), in order to gain 333 further insights into ecological displacement of zebra by quagga mussels, and to investigate 334 genetic underpinning of their relative invasiveness, such as comparative work on byssogenesis 335 that may help account for the slower geographic spread of quagga mussels.

336
The existence of genomic resources for D. polymorpha and the catalog of genes we have 337 identified will enable multiple new lines of investigation, as well as provide researchers with an 12 improved tool for population genetic experiments, for instance, tracking the spread of mussels 339 using Genotyping-by-Sequencing (GBS) approaches, or designing new targeted assays for the 340 presence or activity of zebra mussels.

341
While it is clear that changes in transportation networks (e.g. canal building, opening of 342 shipping channels, ballast water discharge) were the events that initiated primary invasions of 343 European and North American waters 5,90 , several biological characteristics are responsible for 344 the rate of spread of zebra and quagga mussels across both continents, while other traits have 345 limited their suitable habitat range. Genomics offers a path to understanding these traits at the 346 genetic level, which may ultimately guide the development of control methods and management 347 strategies.

350
Zebra mussel individuals were collected by SCUBA from off the Duluth waterfront beach 351 (46.78671°N, -92.09114°W), in Lake Superior in June of 2017. Mature adults were dissected.

352
To sex the animals, gonad squashes were prepared and examined under a compound 353 microscope for gametes, and a set of large males (25-30 mm shell length) were selected for 354 genome sequencing and analysis. Genomic DNA was extracted using the Qiagen Genomic Tip 355 100/G kit, with all tissues (except gut) from each selected individual split across six total 356 extractions to prevent clogging of Genomic Tips. Pooled extractions from one chosen individual 357 yielded >100 ug genomic DNA as assessed by PicoGreen DNA quantification (ThermoFisher).

358
The Agilent TapeStation Genomic DNA assay indicated that the majority of gDNA extracted was 359 well over 20kb (not shown). Further analysis by Pulsed-Field Gel Electrophoresis indicated a 360 broad distribution from 20-120kb, with a modal size of 40kb (not shown).

362
Thirty µg of gDNA was sheared by passing a solution of 50 ng/uL DNA through a 26G blunt-363 tipped needle for a total of 20 passes. This sheared DNA was cleaned and concentrated using 364 AMPurePB beads with a 1X bead ratio, and further library preparation was performed following was checked twice daily with digital probes. Mussels were fed 1.8 ml per tank of liquid shellfish 417 diet (Reed Mariculture, Campbell CA) once daily, with water flow shut off for 1.5 hours for 418 feeding. Tank temperatures were raised to 24-25°C over three days by mixing in heated well 419 water; then temperatures were held constant over seven days for acclimation.

421
Experimental treatments followed, with each group of 4 tanks raised 1°C per day (using a 200 422 W aquarium heater in each tank) to target temperatures of 25, 27 and 30°C then maintained at 423 target for seven days. For gill transcriptomes, two mussels per each of four treatment tanks 424 were removed, then both ctenidia were dissected and preserved in 750 µL RNAlater per animal 425 at -20°C. For foot, mussels from Lake Waconia, attached firmly to rocks and maintained for 426 seven days in each of two of the 25°C tanks above were selected. Byssal threads were severed 427 where they enter the shell valves to induce byssus growth and reattachment. Immediately 428 thereafter, foot tissue (distal tip region) was dissected from each of eight animals (for a time-429 zero control) and preserved in RNAlater. Byssus-cut animals were painted with nail polish and 430 placed onto rocks in each of two tanks at 25°C. Mussels that firmly attached overnight were 431 observed for four days and eight days after reattachment, and four firmly attached mussels per 432 time point were selected and foot tissue was dissected and preserved as above.

486
Chromatin conformation capture data was generated using a Phase Genomics (Seattle, WA)

487
Proximo Hi-C Animal Kit v1.0, which is a commercially available version of the Hi-C protocol 91 .

488
Following the kit protocol, intact cells from two samples were crosslinked using a formaldehyde 489 solution, digested using the Sau3AI restriction enzyme, and proximity-ligated with biotinylated 490 nucleotides to create chimeric molecules composed of fragments from different regions of the 491 genome that were physically proximal in vivo, but not necessarily proximal in the genome.

492
Continuing with the manufacturer's protocol, molecules were pulled down with streptavidin

497
Reads were aligned to the draft assembly also following the manufacturer's recommendations 498 98 . Briefly, reads were aligned using BWA-MEM 99 with the -5SP and -t 8 options specified, and 499 all other options default. SAMBLASTER 100 was used to flag PCR duplicates, which were later 500 excluded from analysis. Alignments were then filtered with samtools 101 using the -F 2304 501 filtering flag to remove non-primary and secondary alignments and further filtered with matlock 502 102 (default options) to remove alignment errors, low-quality alignments, and other alignment 503 noise due to repetitiveness, heterozygosity, and other ambiguous assembled sequences.

505
Phase Genomics' Proximo Hi-C genome-scaffolding platform was used to create chromosome-506 scale scaffolds from the corrected assembly as described in Bickhart et al. 37 . As in the 507 LACHESIS method 103 , this process computes a contact frequency matrix from the aligned Hi-C 508 read pairs, normalized by the number of Sau3AI restriction sites (GATC) on each contig, and 509 constructs scaffolds in such a way as to optimize expected contact frequency and other 510 statistical patterns in Hi-C data. Approximately 140,000 separate Proximo runs were performed 511 to optimize the number of scaffolds and construction of to make them as concordant with the 512 observed Hi-C data as possible. This process resulted in a set of 16 chromosome-scale Mapping of PacBio reads to an initial Canu assembly for the mitochondrial genome indicated a 518 small region of very high coverage (Supplemental Fig. 5). An alternate assembly of the 519 mitochondrial genome was substituted which was generated in parallel in FALCON 0.5 520 (length_cutoff = -1, seed_coverage = 30, genome_size = 2.7G) and which did not collapse this 521 repeat sequence. This assembly was polished for indels via Pilon using Illumina reads as with 522 the nuclear genome, and a single substitution error in the coding region was manually edited 523 (c.14475 C>A, G184W) based on strong support from Illumina reads (data not shown). The 524 mitochondrial genome was annotated based a previously published partial mitochondrial 525 sequence 26 in Geneious using the "Annotate from Database" function with a 98% similarity 526 cutoff. The origin point was set to place the tRNA-Val annotation at base 48, matching the 527 previously published sequence.

529
PacBio and Nanopore reads were mapped against a reference file containing two concatenated 530 copies of the mitochondrial genome sequence, in order to allow reads to map across the origin.

531
Alignments were generated with minimap2 -ax using settings map-pb and map-ont, 532 respectively. Visualization of the resulting alignments ( Fig. 2c) was performed using a custom

537
Hi-C analysis of the mitochondrial contig 538 Ten contigs ranging in size from 50kb to 100kb were selected from each of the pseudo-539 chromosome scaffolds. The total number of Hi-C contacts between each selected contig and 540 each pseudo-chromosome was determined. The same analysis was performed using the 541 mitochondrial contig, then all Hi-C link counts were normalized by dividing the number of 542 contacts between a contig and pseudo-chromosome by the total number of Hi-C contacts 543 associated with the contig. The resulting normalized data was visualized using ggplot2 to 544 develop boxplots that compare the number of links for contigs based on their association with 545 each pseudo-chromosome. database of ribosomal RNA was downloaded from SILVA 104-106 , restricting the entries to 550 Bivalvia. The combined RNAseq reads were cleaned of putative ribosomal RNA sequences 551 using "BBDuk" from the BBTools suite of scripts 107 , treating the Bivalvia ribosomal RNA as 552 potential contaminants, using a k-mer size of 25bp and an edit distance of 1. Reads that passed 553 this filter were then assembled with Trinity 2.8.4 108 with a "RF" library type, in silico read 554 normalization, and a minimum contig length of 500bp. Assembled transcripts from Trinity were 555 then searched against the non-redundant nucleotide sequence database hosted by NCBI, 556 current as of 2018-10-09. A maximum of 20 target sequences were returned for each transcript, 557 restricted by a minimum of 10% identity and a maximum E-value of 1 x 10 -5 . Assembled 558 transcripts that matched sequences derived from non-eukaryotes or synthetic constructs were 559 discarded.

562
RNAseq reads were checked for quality issues, adapter content, and duplication with FastQC 563 0.11.7. Cleaning for sequencing adapters, trimming of low-quality bases, and filtering for length 564 were performed with Trimmomatic 3.3 109 . The adapter sequences that were targeted for 565 removal were the standard Illumina sequencing adapters. Quality trimming was performed with 566 a window size of 4bp and a minimum mean quality score of 15. Reads that were shorter than 567 18bp after trimming were discarded.

569
Reads were aligned to the HiC-scaffolded genome assembly draft with HISAT2 2.1.0 110 , with 570 putative intron-exon boundaries inferred with genes with functional annotation from the draft 571 annotation and a bundled Python script. Read pairs in which one read failed quality control were 572 not used in alignment and expression analysis. BAM files from HISAT2 were cleaned of reads 573 with a mapping quality score of less than 60 with samtools 1.7. Cleaned alignments were used 574 to generate expression counts with the featureCounts program in the Subread package v. 1.6.2 575 111 . Both reads in a pair were required to map to a feature and be in the proper orientation for 576 them to be counted. Raw read counts were imported into R 3.5.0 112 for analysis with edgeR 577 3.24.3 113 . Genes that were less than 200bp were removed from the counts matrix. Tests for 578 differential expression were performed between experimental conditions within tissue. For each 579 tissue, genes with low expression were filtered in the following way: genes in which at least X 580 samples with fewer than 10 were removed, where X is the size of the condition with the fewest 581 replicates. Tests for differential expression used a negative binomial model for dispersion 582 estimation, and genes showing significant levels of differential expression were identified with a quasi-likelihood F test implemented in edgeR 114 . Genes were identified as differentially 584 expressed if they had a nominal P-value of less than 0.01 in the output from the 'glmQLFTest'

602
A sequence amplified from D. polymorpha using SLE-targeting degenerate primers 57 was used 603 as the basis for an initial BLAST search of the genome assembly. Dotplots of the sequence 604 surrounding hits were analyzed to identify fifty putative LTR sequences, and these were aligned 605 to build a consensus LTR sequence specific to our assembly. A subsequent BLAST search with 606 this consensus sequence was performed, and surrounding sequence context was examined for selected based on the AIC option in SMS 117 , using the option to estimate amino acid 617 frequencies from the data. A maximum likelihood genealogy was built using PhyML 118 , using the 618 NNI tree topology search and the BIONJ starting tree options, and support for nodes was 619 evaluated based on 100 bootstrap replications.

621
Next, we used DNA sequence genealogies to further investigate whether HTT events led to 622 insertions of 20 SLEs that we found in the zebra mussel genome that contained 2 LTRs flanking 623 an intact Gag-Pol ORF, including the 8 elements above. From GenBank, we downloaded 624 sequences from multiple bivalve species, from the region located between the RNase H and 625 integrase domains of Gag-Pol that was amplified using degenerate primers 57 . We added 3 626 sequences of long ORFs from Gag-Pol that were cloned from neoplastic tissue 119 , 3 that were

Comparison to Eastern Oyster (Crassostrea virginica) Proteins
Zebra mussel genes with functional annotation information were used to identify groups of orthologous genes with Eastern Oyster (Crassostrea virginica). Annotated protein sequences 654 from C. virginica were downloaded from the C_virginica-3.0 assembly and annotation hosted on

663
First, from the preliminary set of scaffolds, annotations were generated in GFF format as with a custom Python script that compares the sequences that correspond to the gene regions 670 on each assembly. Gene sequences were extracted from the assemblies by strand-aware 671 retrieval of the "gene" features in the GFF3 file; that is, if a gene is annotated on the "minus" 672 strand, then the sequence is reverse-complemented. The resulting sequences were then 673 compared between assembly versions to ensure that the re-calculated annotations 674 corresponded to exactly the same sequence. In cases of sequence mismatch, the sequence 675 based on re-calculated annotations was reverse-complemented to diagnose strand conversion 676 issues. Annotations which spanned more than one contig in the preliminary set of scaffolds 677 were discarded during this process; it is likely such annotations, which spanned placeholder 678 unestimated gaps between contigs, were spurious calls 134 . The script used for porting the 679 annotations is available here: https://github.com/phasegenomics/annotation_mover.

747
Arrows label HTT events 1 and 2, identified previously 57 and HTT 3, which we identified based 748 on the same criteria. Together these account for two independent insertions of SLEs into zebra 749 mussels. Clade D contains SLE sequences from the zebra mussel genome; "D. polymorpha C" 750 = chromosomal location of the SLE, with letters to order multiple insertion sites. Taxon labels 751 include NCBI Accession number, taxon, followed by isolate number or code. * = Sequence is

897
Mantle_DEGs.csv 898 899 Supplemental File 9. Differentially expressed genes in the gill in mussels exposed to 900 different thermal stress conditions.

976
Differences in growth and survivorship of zebra and quagga mussels: size matters.          Tenascin-R-like DPMN_149963 Chitinase-3 DPMN_104042 Tyrosinase-like protein DPMN_030364 Pacifastin-related precursor DPMN_128636 Deleted in malignant brain tumors 1 DPMN_038422 Leucine-rich repeat-containing protein Figure 5 a b a Figure 6 b c