Origin and fate of supergenes in Atlantic cod

Supergenes are sets of genes that are inherited as a single marker and encode complex phenotypes through their joint action. They are identified in an increasing number of organisms, yet their origins and evolution remain enigmatic. In Atlantic cod, four large supergenes have been identified and linked to migratory lifestyle and environmental adaptations. Here, we investigate the origin and fate of these four supergenes through analysis of whole-genome-sequencing data, including a new long-read-based genome assembly for a non-migratory Atlantic cod individual. We corroborate that chromosomal inversions underlie all four supergenes, and show that they originated separately, between 0.40 and 1.66 million years ago. While introgression was not involved in the origin of the four supergenes, we reveal gene flow between inverted and noninverted supergene haplotypes, occurring both through gene conversion and double crossover. Moreover, the presence of genes linked to salinity adaptations in a sequence transferred through double crossover indicates that these sequences exchanged between the haplotypes are subject to selection. Our results suggest that the fate of supergenes is comparable to that of hybridizing species, by depending on the degree to which separation is maintained through purging of introduced genetic variation.

20M 30M 10M 20M 10M 20M LG 1 LG 2 LG 7 LG 12  Fig. 1 Four supergenes associated with large chromosomal inversions in Gadus morhua. a Migratory and stationary Gadus morhua seasonally co-occur along the coast of northern Norway and differ in total length and otolith measurements [42,59]. The distribution of stationary Gadus morhua is shaded in gray whereas the seasonal movements of migratory Gadus morhua are indicated with dark gray arrows. While the gadMor2 genome assembly [35] comes from a migratory individual, the gadMor Stat assembly presented here comes from a stationary individual. Both of these individuals were sampled at the Lofoten islands. b Pairwise genetic distance between the gadMor2 and gadMor Stat assemblies, relative to the genetic distance to an assembly for Melanogrammus aeglefinus (melAeg)[60] in a three-way whole-genome alignment. The alignment coordinates are according to the gadMor2 assembly. The four LGs 1, 2, 7, and 12 are shown as rounded horizontal bars, on which circles indicate the approximate centromere positions [48]. Supergene regions are shaded in gray, and the beginning and end of each of these regions are shown in more detail in the insets below each linkage group. Each of these insets focuses on a section of 100 kbp around a supergene's beginning or end. Shown in black above the bar representing that section is a per-SNP measure of linkage disequilibrium (LD), based on which the gray shading on the bar illustrates the beginning or the end of high LD. Drawn below the scale bar are contigs of the gadMor Stat and melAeg assemblies, in light gray and dark gray, respectively, that align well to the shown sections. The arrows indicate the alignment orientation of contigs (forward or reverse complement), and contigs are labelled with numbers as in Supplementary Table 2. In the first insets for LGs 1 and 7, vertical bars indicate inferred inversion breakpoints, which are found up to 45 kbp (Table 1) Figure 8) suggests that the actual end of the inversion region region is misplaced in the gadMor2 assembly, between positions 18,890,477 and 18,900,044, and that the region from position ∼16,800,000 and ∼18,900,000 in the gadMor2 assembly is instead located after the current position ∼26,100,000.
4 Due to repetitive sequences at the beginning of the inversion region, contigs mapping inside and outside of the region overlap between positions 18,487,151 and 18,494,225. 5 This is the end of the linkage group in the gadMor2 assembly; however, comparison with the gadMor3 assembly suggests that the region from position ∼22,600,000 to ∼23,700,000 in the gadMor2 assembly is incorrectly placed and instead located at the end of the linkage group.
the common ancestor of the genus Gadus to Arctogadus in the latter case (     Fig. 2 Divergence times and introgression among Gadinae. a Distribution ranges of sampled species of the genera Gadus, Arctogadus, and Boreogadus. Partially overlapping distribution ranges are shown in dark gray. b Species tree of the six species and three outgroups (in beige; P. virens, M. aeglefinus, and M. merlangus), estimated under the isolation-with-migration model from 109 alignments with a total length 383,727 bp. The Bayesian analysis assigned 99.7% of the posterior probability to two tree topologies that differ in the position of Arctogadus glacialis and were supported with Bayesian posterior probabilities (BPP) of 0.763 and 0.234, respectively. Rates of introgression estimated in the Bayesian analysis are marked with arrows. Thin gray and beige lines show individual trees sampled from the posterior distribution; the black line indicates the maximum-clade-credibility summary tree, separately calculated for each of the two configurations. Of G. morhua, both migratory and stationary individuals were included. c Pairwise introgression among species of the genera Gadus, Arctogadus, and Boreogadus. Introgression was quantified with the D-statistic. The heatmap shows two versions of the D-statistic, D fix and DBBAA, below and above the diagonal, respectively. d Introgression across the genome. The D-statistic is shown for sliding windows in comparisons of three species. The top and bottom rows show support for introgression between B. saida and A. glacialis and between G. morhua and G. ogac, respectively. Results are shown separately for the stationary and migratory G. morhua genomes. The mean D-statistic across the genome is marked with a thin solid line. Fish drawings by Alexandra Viertler. of introgression between Boreogadus and Arctogadus was corroborated by a tree-based equivalent 234 to Patterson's D-statistic that does not rely on the molecular-clock assumption, the D tree -statistic Newfoundland, Iceland, Lofoten, and Møre, we discriminated between "migratory" and "stationary"  gene flow after divergence (Fig. 3b, Supplementary Figure 4).

257
The genetic diversity, quantified by π [76], was comparable among the populations of both groups, LG 1 LG 2 LG 7 LG 12 0.5 myr 0.5 myr LG 1 LG 2 LG 7 LG 12 boundaries of the supergenes and stronger towards their centers, which could generate U-shaped  The region covered by the introduced sequence contains 24 predicted genes (Supplementary Table   384 20), amongst them three yolk precursor genes (vitellogenin) whose gene ontology classification  and all other SNPs with which it was found to be linked with R 2 > 0.8. We found this measure to 520 illustrate well the sharp decline of linkage at the boundaries of the four supergenes (Fig. 1). and setting all sites to missing at which one or more of the three alignment types differed. Alignment 553 sites that opened gaps in the gadMor2 sequence were deleted so that the resulting strict consensus 554 alignment retained the coordinate system of the gadMor2 assembly.

555
Based on the threeway whole-genome alignment, we calculated the genetic distance between 556 the gadMor Stat and gadMor2 assemblies, relative to the genetic distance between the melAeg and 557 gadMor2 assemblies, in sliding windows of 100,000 bp. We also used the threeway whole-genome 558 alignment to generate a mask of unreliable alignment sites, including all sites that had been set to 559 missing in the alignment.

560
Estimating divergence times of Gadinae. We estimated the divergence times among species 561 of the subfamily Gadinae with a phylogenomic approach on the basis of published age estimates for 562 two divergence events. The phylogenomic dataset used for these analyses comprised genome assem- greater than 0.2 or an entropy-like score greater than 0.5, and we excluded exon alignments with a 578 length shorter than 150 bp, more than two missing sequences, or a GC-content standard deviation 579 greater than 0.04. We then grouped exon alignments by gene and excluded all genes that 1) were 580 represented by less than three exons, 2) had one or more completely missing sequences, 3) were   [110]. We time-  for these Relate analyses was identical to the one used for the analysis with genome-wide SNPs.

759
Estimating population divergence times across the genome. In addition to the genome-wide 760 and supergene-specific SNAPP analyses that used biallelic SNPs from the entire genome or the entire 761 length of supergene regions, we also performed sliding-window SNAPP analyses across all linkage 762 groups to quantify differences in population divergence times across the genome. Our motivation for 763 these analyses was primarily to assess whether or not divergence times were homogeneous over the 764 lengths of supergenes, as differences in these divergence times within supergenes could be informative 765 both about the presence of separate inversion within these regions and about their erosion processes.

766
Additionally, we expected that these analyses could reveal further putative inversions elsewhere in 767 the genome if they should exist.

768
From the set of 20,402,423 biallelic SNPs, we extracted subsets of SNPs for each non-overlapping 769 window of a length of 250,000 bp, with a minimum distance between SNPs of 50 bp. We discarded 770 windows with less than 500 remaining biallelic SNPs and used a maximum of 1,000 biallelic SNPs 771 per window; these were selected at random if more biallelic SNPs were available per window. Input 772 files for SNAPP were then prepared as for the genome-wide and supergene-specific SNAPP analyses.

773
Per window, we performed two replicate SNAPP analyses with an initial length of 100,000 MCMC 774 iterations, and these analyses were resumed up to a maximum of 500,000 MCMC iterations as long 775 as the lowest ESS value was below 100. Windows with less than 300 sufficiently complete SNPs for or with a mean BPP node support value below 0.5 were discarded after the analysis. Per remaining

797
We applied Fisher's exact test with the algorithm "weight01" and adjusted for multiple testing, 798 considering values with a false discovery rate (FDR) below 0.05 as significant.