The pink salmon genome: uncovering the genomic consequences of a strict two-year life-cycle

Pink salmon (Oncorhynchus gorbuscha) adults are the smallest of the five Pacific salmon native to the western Pacific Ocean. Pink salmon are also the most abundant of these species and account for a large proportion of the commercial value of the salmon fishery worldwide. A strict two-year life-history of most pink salmon generates temporally isolated populations that spawn either in even-years or odd-years. To uncover the influence of this genetic isolation, reference genome assemblies were generated for each year-class and whole genome re-sequencing data was collected from salmon of both year-classes. The salmon were sampled from six Canadian rivers and one Japanese river. At multiple centromeres we identified peaks of Fst between year-classes that were millions of base-pairs long. The largest Fst peak was also associated with a million base-pair chromosomal polymorphism found in the odd-year genome near a centromere. These Fst peaks may be the result of centromere drive or a combination or reduced recombination and genetic drift, and they could influence speciation. Other regions of the genome influenced by odd-year and even-year temporal isolation and tentatively under selection were mostly associated with genes related to immune function, organ development/maintenance, and behaviour.

227 some analyses, which are sensitive to linkage disequilibrium. Variants were filtered if heterozygous 228 allele counts were not evenly represented -also known as allele balance (minor allele count < 20% of 229 the major allele count, see (77) for python script). Variants in linkage disequilibrium were thinned 230 using BCFtools (79) (parameters: +prune, -w 20kb, -l 0.4, and -n 2). Custom scripts, bwa mem, and 231 Samtools index were used to map the variants to different genome assemblies (80). As clustering techniques are sensitive to linkage disequilibrium, we used variants that were 248 hard-filtered (including for allele balance) and filtered for linkage disequilibrium for all population 249 structure analyses. A DAPC analysis (81) was used to cluster individuals in R (82) using the following 250 packages: adegenet (83), vcfR (84), and ggplot2 (85). The number of DAPC clusters was determined 251 using the find.clusters function and choosing the cluster count with the lowest Bayesian information 252 criterion. Thirty principal components were retained with the dapc function. The variants used for the 253 DAPC analysis were not yet mapped to chromosomes.

254
To complement the DAPC analysis, we also performed an Admixture (version 1. 3.0) analysis 255 (86) to identify clusters of individuals and quantify the admixture between the identified groups. To 256 format the linkage disequilibrium thinned .vcf file, we used a custom Python script to rename the 257 chromosomes to numbers (77) and PLINK (version 1.90b6.15) (87,88) was used to generate .bed files 258 (parameters: --chr-set 26 no-xy, --double-id). PLINK was also used to generate a principal components 259 analysis. The Admixture software was then used to identify the optimal cluster number based on the 260 lowest cross-validation error value. The admixture values from this analysis were plotted in R.

261
To examine population structure based on the mitochondrion sequence, we generated a 262 phylogenetic tree based on full mitochondria sequences. The genome assembly included a 263 mitochondrion sequence, and this region of the genome was subset from the variant file using vcftools.  294 To identify regions of the genome associated with population structure identified in the DAPC 295 analysis, we performed an eigenGWAS analysis (102). The format of the hard-filtered variants was 296 converted to the appropriate format in PLINK, and the GEAR (103) software was used to run the 297 eigenGWAS analysis (this was performed on a slightly different version of the genome assembly 298 available on the NCBI, but only positions on chromosome 9 were minimally affected). Significance 299 was corrected for using the genomic inflation factor to better identify markers potentially under 300 selection rather than a result of genetic drift between populations. The genomic inflation factor 301 corrected p-values were then plotted in R using the qqman (104) and stringr (105) 308 We utilized a genome-wide association (GWA) of phenotypic sex to identify the region of the 309 genome associated with sex for all pink salmon (individual year-classes were checked as well). This 310 analysis was also used to identify where the contig from the genome assembly with the sdY gene 311 should be placed. This was confirmed with synteny from the rainbow trout Y-chromosome 312 (NC_048593.1) and manual inspection of the Hi-C data (it was placed in the even-year genome 313 assembly). The GWA analyses were performed using PLINK (parameters: --logistic --perm). Synteny 314 was identified from alignments to the rainbow trout genome assembly (GCF_013265735.2, (63)) using 315 CHROMEISTER (106)

333
The odd-year assembly had 26 linkage groups and extensive homeologous regions between 334 chromosomes (Fig 1). The odd-year genome assembly contained similar levels of repetitive DNA and 335 duplicated regions compared to other salmonids (Fig 1, (70,77,109)). Like other salmon species, 336 increased sequence similarity was also observed at telomeres between duplicated chromosomal arms 337 (Fig 1). Peaks of increased Fst between odd and even-year lineages were commonly found at putative 338 centromere locations (Fig 1, Table 1).   356 A shared allele analysis (Fig 2) and both Admixture and DAPC analyses (Fig 3) revealed a clear 357 delineation between odd and even-year lineages. Parent-progeny and sibling relationships 358 (relationships known during sampling) are highlighted by increased levels of shared alleles, but the 359 majority of clustering appears to be related to geographical distance (Fig 2, S1 File). No apparent 360 admixture was observed in the even-year class (Fig 3B). In the odd-year lineage, estimated ancestry 361 from the even-year group varied from zero to over forty percent (Fig 3B).  (Figs 4A and B). The even-year salmon had two major haplotypes (Fig 4B). Mitochondrial 378 sequence analyses revealed 21 unique haplotypes from the 61 mitochondria sequences with 1-19 steps 379 between haplotypes (Fig 4). Based on the length of the sequence analyzed (16,822 bp) this represents a 380 mutation frequency between 0.006% to 0.1%. One haplotype was shared between lineages and the 381 closest haplotype that was not shared had 5 steps between year-classes (Fig 4). The mitochondrial 382 analyses illustrate divergence between the odd and even-year lineages, but also raises questions 383 regarding possible recent admixture based on a shared haplotype and an odd-year haplotype most 384 closely related to an even-year haplotype.  (Table 2). These values 401 varied based on sample inclusion and filtering parameters used for filtering nucleotide variants (Table  402 2). The average percent of shared alleles among odd-year fish was 76.13%, 74.42% among even-year 403 individuals, and 71.04% between year-classes (S1 File). Most analyses revealed increased genetic 404 diversity among odd-year pink salmon than among even-year pink salmon and fewer shared alleles 405 between odd and even-year populations than within year-class.
406  411 We identified regions of the genome with divergence between odd and even-year lineages using 412 an eigenGWAS and Fst analysis. Seventeen significant regions of the genome were discovered to 413 contribute to the divergence between odd and even-year lineages (Fig 5, Table 3). These regions are 414 putatively under selection as genetic drift is partially accounted for through the genomic inflation 415 factor. Multiple candidate genes under selection were identified in these regions (Table 3). 422 Table 3. Top eigenGWAS peaks identified between lineages.

H2-Aa
LG19_El20. LG25_El23.1-24.1 37126251-37202999 37137812 uncharacterized gene (ncRNA) 423 All positions are relative to the odd-year genome. 424 *associated with an Fst peak 425 426 Seven of the eight largest Fst peaks between year-classes were located in the vicinity of a 427 centromere (Fig 1, Table 1). More detail is presented on one of the largest Fst peaks. This peak is also 428 associated with a large deletion or fusion. The Fst peak on LG15_El12.1-15.1 (Fig 6A) is in Hardy-429 Weinberg equilibrium in the odd-year lineage (p=0.984 with a chi-square test), but fixed in the even-430 year lineage (Figs 6B and 6C). When Oxford Nanopore reads from the two year-classes were aligned 431 to the genome assembly, a heterozygous deletion or fusion from 51,670,144 -52,926,328 was found in 432 this region of the odd-year male used for genome assembly (S1 Fig.). The ~1.2 Mbp deletion/fusion 433 may explain why the LG15_El12.1-15.1 Fst peak was one of the largest and widest (Fig 1, S1 Fig.).  (66), but further research will be 446 needed to confirm this hypothesis. From these analyses, many highly divergent regions of the genome 447 were identified, either from selection or from genetic isolation/population dynamics. The largest 448 reservoirs of divergence between odd and even-year classes appears to be associated with centromeres, 449 but not exclusively and uncommonly for regions putatively under selection (Table 3). The sex-determination gene in salmonids, sdY (111), was located on a ~110 kbp contig in the 453 pink salmon odd-year genome assembly (NCBI accession: JADWMN010014055.1) and on a contig 454~367 kbp that was placed onto a chromosome in the even-year genome assembly. The sdY gene can 455 be placed at one of the ends of LG20_El14.2 by using genome-wide association with sex as the trait of 456 interest, Hi-C contact data (even-year genome), and synteny with the rainbow trout Y-chromosome and 457 chromosome 29 (an autosome) of the coho salmon (Fig 7A, S1 and S3 Figs.).
LG20_El14.2, has the 458 reverse orientation in the odd-year assembly compared to the genetic map (Fig 1), but was corrected to 459 have the same orientation in the even-year assembly.  Fig.). Genotypes were called for the majority of this region for males and females, 474 and the main difference related to sex was that all females had large runs of homozygosity while many 475 males had large runs of heterozygosity (S4 Fig, S1 File).

476
From previous research (112,113), a pseudo growth hormone 2 gene was shown to be tightly 477 linked to sex-determination in pink salmon. Four tandem duplicates of this gene (NCBI: DQ460711.1) 478 were identified on the same contig in the even-year genome assembly, but only two copies were found 479 in the odd-year genome assembly on separate contigs (S1 File). As these contigs were not mapped to a 480 chromosomal position, it is likely that parts of the Y-chromosome specific region remain incomplete in 481 these two assemblies.

482
There were two sdY haplotypes (variants found in non-coding DNA) observed in both odd and 483 even male pink salmon (Fig 7B, S1 File). Additionally, some males possessed multiple copies of the 484 sdY gene (Fig 7B). The CGGA sdY haplotype was only identified in a single odd-year male pink 485 salmon, while the TTAC haplotype was evenly distributed between year-classes (S1 File).

486
Based on manual inspection of the genotypes, long stretches of heterozygosity were observed 487 near the sdY gene in some males, but not in others. In males with the TTAC sdY haplotype, there were 488 extended or short runs of heterozygosity evenly distributed between year-classes (S1 File). Even-year 489 males with the TTAC sdY haplotype and a short run of heterozygosity were more likely to have 490 multiple copies of the sdY gene (n=4, average=2.7) than the same group with long runs of 491 heterozygosity (n=4, average=0.9, p=0.017 with one-tailed, unpaired t-test  (23,54), pink salmon population structure divergence was found at 499 the whole-genome level to be greater between year-classes rather than based on geography. Shared

514
The low nucleotide diversity of mitochondrial haplotype networks and the increase of rare 515 haplotypes have led previous studies to conclude that pink salmon (with some local exceptions) have 516 undergone a bottleneck during the Pleistocene interglacial period and rapid expansion since the last 517 glacial maximum or earlier (33,34,114). The interconnected mitochondrial networks in these studies 518 have inner shared haplotypes between year-classes. Churikov and Gharrett (2002) suggested that these 519 observations supported a model where a year-class might go extinct and an alternate year-class would 520 then replace that population rather than continued gene flow between year-classes that would be 521 necessary to otherwise explain the shared haplotypes (incomplete lineage sorting was tested) (33). The 522 mitochondrial network seen in this study is consistent with that hypothesis. An alternative hypothesis 523 is that environmental factors influence maturation timing and the strict two year life-cycle of pink 524 salmon, and gene flow between year-classes only occurs when environmental conditions favour 525 changes to the two year life-cycle, as that seen in the introduction of pink salmon to the Great Lakes 526 (7,35,36).

527
Estimates of divergence based on mitochondrial sequences suggest that odd and even-year 528 lineages (from East Asia and Alaska) are relatively recent for pink salmon as a species (generally less 529 than 1 million years ago) and divergence likely began during the Pleistocene interglacial period or later 530 (22,33,34). If the two-year life-cycle is environmentally influenced, these estimates could be distorted 531 by phases of gene-flow and would suggest that the interglaciation period was the last major period of 532 gene-flow between odd and even-year classes (but does not necessarily mean that was when the strict 533 two year life-cycle evolved).

534
It has previously been reported that the odd-year lineage of pink salmon has higher levels of 535 heterozygosity, private alleles, and allelic richness (23,54). 541 Further sampling will be required to understand if this phenomenon is seen in all even-year populations 542 (especially as lower heterozygosity in the even-year lineage is not universally supported, e.g. (20)

553
Interestingly, pink salmon runs of homozygosity/heterozygosity were common at centromeres 554 rather than an effect of chromosomal polymorphisms. Six other major runs of 555 homozygosity/heterozygosity were also located near centromeres and they differed between lineages.
556 All of these Fst peaks extend for at least 1 Mbp. It is expected that regions with reduced 557 recombination, such as centromeres, will have increased runs of homozygosity and reduced genetic 558 diversity (reviewed in (115)). This may help explain why there are long runs of homozygosity at 559 centromeres, but not why there are differences between lineages at these loci. Genetic drift or selection 560 such as centromere drive would also need to be considered.

561
The centromere drive hypothesis posits that a centromere can be retained in a female gamete 562 more often than an alternative centromere during meiosis due to an advantageous DNA sequence 563 mutation at the centromere or from mutations in centromere associated proteins (reviewed in (116-564 118)). In populations that become isolated, the competition between centromere sequences can quickly 565 drive differentiation at these regions between the populations and result in hybrid defects should they 566 come into contact again (117). These observations reveal that the pink salmon lineages may be at a 567 point where speciation is a likely outcome as these large centromere differences could cause hybrid 568 defects. In medaka, genomic diversity at non-acrocentric repeats in centromeres were associated with 569 speciation (119).