Pathways to polar adaptation in fishes revealed by long-read sequencing

Long-read sequencing is driving a new reality for genome science where highly contiguous assemblies can be produced efficiently with modest resources. Genome assemblies from long-read sequences are particularly exciting for understanding the evolution of complex genomic regions that are often difficult to assemble. In this study, we leveraged long-read sequencing data to generate a high-quality genome assembly for an Antarctic eelpout, Opthalmolycus amberensis, the first for the globally distributed family Zoarcidae. We used this assembly to understand how O. amberensis has adapted to the harsh Southern Ocean and compared it to another group of Antarctic fishes: the notothenioids. We showed that selection has largely acted on different targets in eelpouts relative to notothenioids. However, we did find some overlap; in both groups, genes involved in membrane structure, thermal tolerance, and vision have evidence of selection. We found evidence for historical shifts of transposable element activity in O. amberensis and other polar fishes, perhaps reflecting a response to environmental change. We were specifically interested in the evolution of two complex genomic loci known to underlie key adaptations to polar seas: hemoglobin and antifreeze proteins (AFPs). We observed unique evolution of the hemoglobin MN cluster in eelpouts and related fishes in the suborder Zoarcoidei relative to other Perciformes. For AFPs, we identified the first species in the suborder with no evidence of afpIII sequences (Cebidichthys violaceus) in the genomic region where they are found in all other Zoarcoidei, potentially reflecting a lineage-specific loss of this cluster. Beyond polar fishes, our results highlight the power of long-read sequencing to understand genome evolution.

data to generate a high-quality genome assembly for an Antarctic eelpout, Opthalmolycus 30 amberensis, the first for the globally distributed family Zoarcidae. We used this assembly to 31 understand how O. amberensis has adapted to the harsh Southern Ocean and compared it to 32 another group of Antarctic fishes: the notothenioids. We showed that selection has largely acted 33 on different targets in eelpouts relative to notothenioids. However, we did find some overlap; in 34 both groups, genes involved in membrane structure, thermal tolerance, and vision have 35 evidence of selection. We found evidence for historical shifts of transposable element activity in 36 O. amberensis and other polar fishes, perhaps reflecting a response to environmental change. 37 We were specifically interested in the evolution of two complex genomic loci known to underlie 38 key adaptations to polar seas: hemoglobin and antifreeze proteins (AFPs). We particularly valuable for understanding phenotypes that stem from complex, difficult to assemble 51 regions of the genome (e.g., highly repetitive regions). For instance, the evolution of antifreeze 52 proteins (AFPs)-proteins that adsorb to ice and limit its growth (Davies,Baardsnes,Kuiper,& 53 Walker, 2002; X. Li et al., 1985)-has led to a wide array of teleost fishes thriving in marine 54 habitats that are commonly below freezing and ice-laden (Hobbs,Hall,Graham,Davies,& 55 cooling of the Southern Ocean-making these groups an ideal system for studying potential 107 convergent evolution to the contemporary, subfreezing conditions of polar seas. 108

109
In this study, we used long-read sequencing to generate a high-quality genome assembly for an 110 Antarctic eelpout, Ophthalmolycus amberensis, the first for the family Zoarcidae. We then 111 leveraged our new genomic resource to better understand how fishes have adapted to the 112 Southern Ocean through whole-genome perspectives on genome evolution and fine-scale 113 investigation of synteny and copy number variation in key adaptive regions (i.e., hemoglobin 114 and AFPs). Collectively, our results highlight convergent and species-specific mechanisms 115 underlying the adaptation of fishes to the harsh conditions of the Southern Ocean. Our study 116 also provides practical evidence for the power and promise of long-read sequencing for 117 understanding genome biology, particularly for highly repetitive and/or complex genomic 118 regions.  v2.14-r883 (Li 2018) and the options: -ax -map-pb. Next, we sorted the resulting BAM file of 192 aligned reads with samtools v1.9 (Li et al. 2009) and default settings. We used the 'readhist' 193 function of Purge Haplotigs to generate a coverage histogram of the mapped PacBio reads. We 194 set low (-l), middle (-m), and high (-h) in the coverage distribution of 10, 140, 195, respectively, 195 and ran the second step in the Purge Haplotigs pipeline, 'contigcov', to analyze coverage on a 196 contig-by-contig basis. Finally, we purged duplicate contigs using the function 'purge' to create a 197 de-duplicated assembly. The resulting assembly (v2) was substantially smaller at 679 Mb with 198 84.5% fewer duplicated genes (n = 183). 199

200
We then polished the v2 assembly using our raw PacBio reads with Arrow, an algorithm in the 201 variantCaller program within the Falcon software (Chin et al. 2013(Chin et al. , 2016. Since this approach is 202 not deterministic (i.e., repeated runs will produce slightly different results), we performed two 203 rounds of polishing with Arrow (resulting in assemblies v3 and v4). For each round, we first re-204 mapped our raw PacBio reads to the latest assembly version (e.g., to make v3, we mapped raw 205 PacBio reads to the v2 assembly) with Minimap2 v2.14-r883 (Li 2018) and the same flags as 206 above. Next, we converted the BAM file of mapped reads to PacBio format with 'pbbamify' (part 207 of the PacBio toolkit) with default flags. We sorted the re-formatted BAM file with samtools v1.9 208 as above (Li et al. 2009). We created an index of the mapped, sorted BAM with another PacBio 209 tool, 'pbindex'. with options est2genome = 1 and protein2genome = 1 in the Maker2 control file. Next, we used 227 the gene finding program, SNAP (Korf, 2004), to generate ab initio gene predictions for our 228 assembly. We first merged all of our gene models from our de novo Maker2 together into a 229 single GFF3 using the Maker2 tool "gff3_merge" and converted them to ZFF format with 230 "maker2zff." We trained SNAP on these data by first breaking up our sequences into one gene 231 per locus (with 1000 bp on either side of the predicted gene) using "fathom" (part of SNAP). We 232 estimated parameters for our genes using "forge" (also included with SNAP) before building a 233 Hidden Markov Model for our annotations with the script "hmm-assembler.pl". After SNAP 234 training, we re-ran Maker2 with our trained SNAP models by adding them in the control file 235 ("snaphmm" option) and setting est2genome and protein2genome both to 0. We performed a 236 second round of SNAP training using the same approach followed by running Maker2 for a third 237 time. Next, we prepared to train another gene predictor, Augustus v2.5.5 (Stanke & Waack,238 2003), on our gene models by first merging the models produced in the second round the SNAP 239 training using "gff3_merge", converting them to ZFF with "maker2zff", and then to GBK format 240 with a custom perl script (https://github.com/hyphaltip/genome-241 scripts/blob/master/gene_prediction/zff2augustus_gbk.pl). As Augustus is a machine learning 242 approach, we first split our data set into a training and test set using the script "randomSplit.pl." 243 We then created a new "species" for Augustus with the script "new_species.pl" then performed 244 training by running the "etraining" function on our training data set followed by evaluating its 245 performance with the "augustus" tool. We optimized our prediction parameters with the script 246 "optimize_augustus.pl." After Augustus training, we re-ran Maker2 as above with the addition of 247 the control file option "augustus_species." Finally, we ran GeneMark for the O. amberensis 248 genome assembly with the self-training algorithm using flag -ES. We then performed a final run 249 of Maker2 with all of the gene model evidence included by adding the path to our GeneMark 250 trained predictions using "gmhmm" to our control file along with the above evidence for 251 Augustus ("augustus_species") and SNAP ("snaphmm"). We kept est2genome and 252 protein2genome as 0 for this final run. Actinopterygii and the custom library from RepeatModeler for each species. We summarized 272 repeat abundance by parsing output from RepeatMasker and generating plots in R v3.5.1 (R 273 Core Team, 2021) using custom scripts. To investigate patterns of transposable element (TE) 274 activity through time we generated TE copy number divergence plots. We calculated Kimura 275 substitution level (CpG adjusted) for transposable elements using the RepeatMasker utility 276 script "calcDivergenceFromAlign.pl", removed non-transposable element and unclassified 277 repeats from the resulting ".divisum" file, and generated age distribution landscapes using 278 RepeatMasker's "createRepeatLandscape.pl" script. These plots are generated by comparing 279 each TE copy to a reference sequence for that element. write those isoforms to a FASTA file (agat_sp_extract_sequences.pl). We then translated CDS 294 to protein using the function "transeq" in EMBOSS v.6.6.0 (Rice, Longden, & Bleasby, 2000). 295 We identified orthologous groups of single-copy coding sequences using the OrthoVenn2 web For the CAFE analysis, we specified significance-i.e., the level at which a gene family change 372 is considered rapid-at P < 0.01. We visualized gene ontology terms for rapidly evolving 373 families using the online server of REVIGO orthologous genes (BUSCOs) in the 4,584 gene reference set for Actinopterygii (Fig. 1D). With 394 the same BUSCO gene set used for assessing the genome assembly completeness, the O.

438
Transposable element (TE) copy divergence plots in O. amberensis suggested evidence of 439 dynamic shifts in transposable activity through time (Fig. 2). LINEs dominated recent and 440 ongoing TE activity (Fig. 2C), which was driven by recent expansion of L2 and Rex/Babar 441 elements, accounting for much of the LINE abundance (blue peak between K = 0-5; Fig. 2C, 442  Fig. 2C, Fig. S1). Three other polar species, C. aceratus, P. charcoti, and G. morhua, 453 all had evidence for ancient TE expansions similar to the Helitron expansion in O. amberensis, 454 however in these species LINEs and LTRs drove the expansion (Fig. 2C, Fig. S1).  In O. amberensis and across Zoarcoidei, the hemoglobin LA cluster has remained stable with 485 the typical organization of two alpha-globin and one beta-globin genes (Fig. 4). However, 486 among perciformes, the beta-gene of the LA cluster, hbbla, was repeatedly lost or 487 pseudogenized in some species, including in the Zorcoidei monkeyface prickleback 488 Cebidichthys violaceus (Fig. 4b). In contrast, the MN cluster in perciformes underwent dynamic 489 changes in a lineage specific fashion and harbored up to 13 alpha-globin and 15 beta-globin 490 genes in the Longspined bullhead sculpin (Taurulus bubalis) and numerous instances of 491 pseudogenization, especially frequent in the Orangethroat darter Etheostoma spectabile (Fig.  492 4c). The MN cluster organization could not be analyzed in the wolf-eel A. ocellatus due to the 493 lack of contiguity in the assembly at this locus. Within Zoarcoidei, however, the MN cluster 494 composition remained relatively stable among the three studied species with three alpha-globin 495 genes and two or three functional beta-genes. A unique feature of the Zoarcoidei relative to the 496 other studied species is the presence of a hbbmn2 pseudogene in C. violaceus and P. 497 gunnellus and the loss of this gene in O. amberensis (Fig. 4c). 498 499 500 501

512
Basic metrics for the genome assemblies included in this analysis are provided in Table S1. Complete 513 annotations for the LA and MN clusters are provided in Tables S2 and S3, respectively.

515
For the afpIII region in Zoarcoidei between sncgb and ldb3b, we found no evidence of afpIII 516 genes in the outgroup, G. aculeatus, nor in one of our in-group species, C. violaceus (family 517 Stichaeidae; Fig. 5). Among the four species with copies of some portion of an afpIII gene 518 (either exon 1, exon 2, or complete copies of both exons 1 and 2 in order), copy number varied 519 widely. Lycodicthys dearborni has at least 23 complete copies of afpIII with some copies of exon 520 1 in isolation scattered throughout the same region (Fig. 5). Lycodicthys dearborni also has two 521 psuedogenized afpIII copies (Fig. 5). Similar to L. dearborni, O. amberensis also had a high 522 number of complete afpIII copies (at least 13) with some copies of exon 1 in the same region. 523 Pholis gunnellus has at least 15 complete afpIII copies as well as partial copies. Finally, A. 524 ocellatus has evidence of afpIII genes but with four complete afpIII copies spread over four of 525 the six scaffolds including components of the afpIII gene, it is unclear how many copies are truly 526 present (Fig. 5). This is primarily because, unlike L. dearborni for instance, no A. ocellatus 527 scaffold has evidence of both afpIII genes and flanking regions. With our short-read, DNA-seq 528 assembly for O. amberensis, we confirmed the useful but coarse nature of inference stemming 529 from short-read data for the study of complex genomic regions. Indeed, similar to A. ocellatus, 530 we found evidence for five partial copies of afpIII spread across five different scaffolds. The

543
dearborni which is not a genome assembly) are provided in Table S1. Complete annotations for all afpIII 544 copies are provided in Table S4 of long-read sequencing has improved the contiguity of difficult to assemble, repeat-rich portions 562 of genomes. Indeed, genome assemblies that integrate long-reads are on average 48x more 563 contiguous than those that do not (Hotaling, Sproul, et al., 2021). In this study, we generated 564 and compared a new, high-quality genome assembly for O. amberensis to other polar and non-565 polar fishes with an emphasis on aspects that are difficult or impossible to assess with short-566 reads alone (repetitive elements and repeat-rich, dynamic loci). Collectively, our results highlight 567 mawsoni, but not in other polar fish lineages, highlighting the dynamic evolution of these 588 elements in notothenoids (Fig. S1). Heating and cooling cycles in polar seas have also been 589 hypothesized to induce environmental stress which in turn drives bursts of TE activity (Auvinet 590 et al., 2018). We found that across polar-adapted lineages, four of six species (O. amberensis, 591 C. aceratus, P. charcoti, and G. morhua) exhibited evidence consistent with ancient TE bursts of 592 similar relative timing (i.e., similar in timing to the peak of Helitron activity in O. amberensis). 593 This modest, but distinct pattern is observed independently in three lineages that lack of 594 phylogenetic (and spatial) overlap (Fig. 2, Fig. S1). While it is premature to assume that these 595 signatures reflect historical, climate-induced shifts that affected fishes at both poles, we view 596 this pattern an interesting preliminary finding that merits additional investigation. Future efforts 597 to integrate dense taxonomic sampling will allow for contextualizing links between TE activity, 598 environmental stress, and genome evolution, particularly as they relate to ancient and 599 contemporary environmental change. other key processes and structures (Rothfield, 2014). Moreover, because environmental 615 conditions often play a role in determining how lipids-a component of biological membranes-616 arrange themselves (Rothfield, 2014), it is unsurprising that membranes are important for 617 thermal adaptation in a wide variety of organisms (Hazel, 1995). Fishes living in the most 618 consistently cold marine habitats on Earth have experienced positive selection on genes related 619 to membrane structure and function (e. g., esrp1, ntn4, snx18). Generally speaking, membrane 620 fluidity is maintained as temperature changes, a phenomenon known as homeoviscous 621 adaptation (Hazel, 1995). However, the extreme, but stable, cold temperatures of polar seas 622 means that maintenance of a homeoviscous adaptation response to changing conditions should 623 not hold any fitness benefit (Cossins & Prosser, 1978). Still, polar fish membranes must function 624 at low temperature and directional selection may have acted on this trait. For instance, one 625 target of selection in polar fishes, snx18, in concert with fip5, is involved in the morphogenesis 626 of the epithelial lumen including the formation of membrane tubules (Willenborg et al., 2011). 627 While the exact role of selection on membrane-associated genes in polar fish remains to be 628 determined, it is plausible that at least one driver is functioning in the cold. where they reside in the water column. Highlighting the power of long-reads for genomic inquiry, 648 we identified rapidly evolving gene families that were also linked to vision. For instance, three 649 additional copies of gja3 are present in O. amberensis relative to other species in the analysis 650 and all five copies are arrayed in a ~135-Kb region of the same scaffold (tig00024373). In 651 humans, GJA3 has been implicated in vision defects (Graw, 2003). In addition to gene copy 652 expansions in vision-related genes, we also detected a shared signature of selection on vision-653 associated genes (iars2, nid2a, and atxn7) in two disparate lineages of polar fish. In mammals, 654

Evolution of hemoglobin genes in Zoarcoidei 663
Hemoglobin is a major trait for adaptation to different environments in fishes (Powers, 1980;664 Verde, Giordano, di Prisco, & Andersen, 2012). Because the capacity of hemoglobin to bind 665 oxygen is sensitive to environmental conditions such as temperature and pH, the molecular 666 evolution of hemoglobin genes has been proposed to be necessary for fish to maintain 667 respiratory homeostasis in new or changing environments (Powers, 1980). Vertebrate 668 hemoglobin is a tetrameric protein composed of two alpha-globin and two beta-globin subunits. 669 Combinations of different alpha and beta subunits can result in multiple hemoglobin tetramers 670 with different functions. This "hemoglobin multiplicity" may also be a source of adaptive potential 671 (Brix, Clements, & Wells, 1999;Houston & Gingras-Bedard, 1994;Verde et al., 2012). 672 Hemoglobin multiplicity may be, at least in part, linked to hemoglobin gene copy number. For 673 instance, cod fishes living in shallow and unstable environments have more hemoglobin gene 674 copies than species living in more stable deep-sea environments (Baalsrud, Voje, et al., 2017). 675 The Southern Ocean is a chronically cold and hyper-oxygenated environment, yet stable marine 676 habitat. In this environment, the suborder Notothenioidei, which includes the white-blooded 677 Antarctic icefishes lacking functional hemoglobin (Kim et al., 2019;Near et al., 2006), are the 678 most common fish species. A reduction of hemoglobin multiplicity prior to cold adaptation 679 followed by a decline in oxygen affinity has been documented in cold-adapted notothenioids 680 (Verde, Parisi, & di Prisco, 2006 followed dramatically different evolutionary paths (Fig. 4). For the most part, the LA cluster has 690 remained stable across species, with two alpha and one beta-globin genes (Opazo et al. 2013), 691 and only a few instances of pseudogenization or loss of the beta-globin gene (Fig. 4a). In 692 contrast, the MN cluster showed dramatic lineage-specific evolution, in line with other teleosts 693 (Opazo et al., 2013). In most species in our comparison, the MN cluster expanded by multiple 694 tandem duplications with frequent pseudogenization. In contrast, the genetic organization of the 695 MN cluster has been relatively stable within Zoarcoidei, with three alpha-globin and two or three 696 beta-globin genes (Fig. 4b) cluster is also remarkable as the studied species inhabit similar environments to other species 703 with highly variable MN clusters. This suggests that adaptation in Zoarcoidei hemoglobin may 704 occur at the gene sequence level rather than duplication as seen in these other perciformes. 705 The study of hemoglobin should be pursued in more Zoarcoidei species, including in the 706 globally distributed eelpout genera (Hotaling, Borowiec, et al., 2021), to better understand the 707 molecular mechanisms underlying adaptation to environmental conditions. 708 Ocean. However, understanding of AFP evolution within the suborder Zoarcoidei-for which 719 eelpouts comprise the most speciose family (Zoarcidae)-remains incomplete, largely due to 720 the complexity of its history with many species and isoforms evolving at both poles and the 721 complexity of the genomic region underlying it. Recent phylogenetic evidence suggests that 722 afpIII arose only once, early in the evolution of the Zoarcoidei at least 18 million years ago 723 (Hobbs et al., 2020). The lack of afpIII evidence in the typical region of the genome for 724 Zoarcoidei for monkeyface prickleback C. violaceus (family Stichaeidae) poses an additional 725 question about the evolution of AFPs in Zoarcoidei. To our understanding, every previously 726 surveyed species in the suborder has exhibited apfIII sequence evidence (at least 14 species 727 from six families, Hobbs et al., 2020). Given that C. violaceus is part of the Zoarcoidei "in group" 728 (Hotaling, Borowiec, et al., 2021), its lack of similar apfIII evidence suggests that most likely the 729 afpIII gene has been secondarily lost in C. violaceus. Other less likely possibilities are that the 730 locus did not arise only once at the base of the clade or it has been horizontally transferred 731 within the clade (e.g., Graham, Lougheed, Ewart, & Davies, 2008). It is also possible that this 732 finding could be due to assembly error, but this is unlikely given the assembly's high contiguity 733 (scaffold N50 = 6.72 Mb, Table S1), lack of N's in the predicted afpIII region, and contiguity of 734 flanking regions (Fig. 5). 735 736 Gene duplication, rearrangement, and modification has long obscured the study of AFP 737 evolution (e.g., Deng et al., 2010). afpIII copy number varies considerably between closely 738 related species (Scott, Fletcher, & Davies, 1986) and even within species across their 739 geographic range (Hew et al., 1988). Short-read genome assemblies limit afpIII copy number 740 inference. For instance, a BLAST search of a short-read assembly for O. amberensis identified 741 five partial afpIII copies spread across five different scaffolds. While there are more robust ways 742 to detect gene copy numbers from short-read genome assemblies (e.g., through liberal BLAST 743 searches and phylogenetics, Baalsrud, Tørresen, et al., 2017), the reality that genomic 744 organization cannot be resolved remains. Long-read assemblies can largely ameliorate this 745 issue. Indeed, following the massive improvements of long-read sequencing for genome 746 assembly, a complete sequence of the complex afpIII locus has now been produced (for P. We have entered an exciting time for genome biology. It is now possible to characterize 755 complex regions of genomes to better understand evolution across virtually any group of 756 interest. In this study, we used long-read sequencing to generate a high-quality reference 757 assembly for an Antarctic eelpout, O. amberensis, the first for the group. We used this new 758 genomic resource to better understand the evolution of O. amberensis in the Southern Ocean, 759 particularly as it relates to other Antarctic fish specialists (i.e., notothenioids) and the rest of the 760 suborder Zoarcoidei. We showed evidence of historical shifts in TE activity in O. amberensis 761 and other polar fishes, perhaps reflecting an ancient burst of TE activity in response to 762 environmental change. Moreover, we found little overlap between the targets of natural 763 selection in O. amberensis and in notothenioids inhabiting the same habitats, indicating that 764 adaptation of both groups to the Southern Ocean appear to have followed largely different 765 trajectories. However, we did find overlap in a few targets of selection, many of which were 766 genes involved in vision or membrane structure. We also observed the relative evolutionary 767 stability of the hemoglobin MN cluster in the Zoarcoidei relative to other teleosts despite these 768 species inhabiting similar environments to species with much more diverse MN clusters. For 769 AFPs, we identified the first species in the suborder Zoarcoidei with no evidence of afpIII genes 770 in the genomic region where they occur for the group (C. violaceus), likely reflecting a lineage-771 specific loss of the cluster. To further understand the evolution of polar fishes, both within and 772 across groups, more dense taxonomic sampling is needed. Snailfishes (order Scorpaeniformes; 773 family Liparidae) occur around the world, including at hadal depths, and are the most speciose 774 deep-water family in the Southern Ocean, making them another exceptional candidate for 775 comparisons to understand the general rules of Antarctic adaptation. 776 777 Beyond polar fish adaptation, the utility of long-read sequencing for understanding complex 778 regions of genomes is clear. The power of these data will increase as accuracy continues to 779 improve (e.g., circular consensus sequencing (Wenger et al., 2019) and reads lengthen. The 780 application of long-read technologies to eelpouts and other zoarcids will yield additional high-781 quality, contiguous sequences of the afpIII locus. The limiting factor for understanding questions 782 similar to those posed in this study is shifting from generating high-quality assemblies to 783 obtaining samples compatible with high-molecular weight DNA extraction and long-read 784 sequencing. We echo previous studies (e.g., Buckner