Abstract
Sea turtles represent an ancient lineage of marine vertebrates that evolved from terrestrial ancestors over 100 MYA, yet the genomic basis of the unique physiological and ecological traits enabling these species to thrive in diverse marine habitats remain largely unknown. Additionally, many populations have declined drastically due to anthropogenic activities over the past two centuries, and their recovery is a high global conservation priority. We generated and analyzed high-quality reference genomes for green (Chelonia mydas) and leatherback (Dermochelys coriacea) turtles, representing the two extant sea turtle families (MRCA ∼60 MYA). These genomes are highly syntenic and homologous, but localized non-collinearity was associated with higher copy numbers of immune, zinc-finger, or olfactory receptor (OR) genes in green turtles, and ORs related to waterborne odorants were greatly expanded in green turtles. These findings suggest that divergent evolution of these key gene families may underlie immunological and sensory adaptations assisting navigation, occupancy of neritic versus pelagic environments, and diet specialization. Reduced collinearity was especially prevalent in microchromosomes, with greater gene content, heterozygosity, and genetic distances between species, supporting their critical role in vertebrate evolutionary adaptation. Finally, diversity and demographic histories starkly contrasted between species, indicating that leatherback turtles have had a low yet stable effective population size and extremely low diversity compared to other reptiles, and a higher proportion of deleterious variants, reinforcing concern over their persistence under future climate scenarios. These genomes provide invaluable resources for advancing our understanding of evolution and conservation best practices in an imperiled vertebrate lineage.
Statement of significance Sea turtles represent a clade whose populations have undergone recent global declines. We analyzed de novo genomes for both extant sea turtle families through the Vertebrate Genomes Project to inform their conservation and evolutionary biology. The highly conserved genomes were largely differentiated by localized gene-rich regions of divergence, particularly in microchromosomes, suggesting that these overlooked genomic elements may play key functional roles in sea turtle evolution. We further demonstrate that dissimilar evolutionary histories impact standing genomic diversity and genetic load, and are critical to consider when using these metrics to assess adaptive potential and extinction risk. Examination of these relationships may be important to reveal drivers of adaptation and diversity in sea turtles and other vertebrates with conserved genome synteny.
Introduction
Sea turtles recolonized marine environments over 100 MYA (1, 2) and are now one of the most widely distributed vertebrate groups on the planet (3). Leatherback turtles (Dermochelys coriacea) represent the only remaining species of the family Dermochelyidae, which diverged from the Cheloniidae (hard-shelled sea turtles) about 60 MYA (4). Unique morphological (Fig. 1a) and physiological traits allow leatherback turtles to exploit cool, highly productive pelagic habitats (5, 6), while green turtles (Chelonia mydas) and other hard-shelled chelonid species largely inhabit warmer nearshore habitats following an early pelagic life stage. Most previous research in this group has focused on organismal and ecological adaptations (7), but the genomic basis of traits that differentiate or unite these species is not well understood.
Anthropogenic pressures have caused substantial population declines in sea turtles, with contemporary populations currently representing mere fractions of their historical abundances (8, 9). Although sea turtles spend most of their life in the ocean, they also exhibit long-distance migrations to natal rookeries for terrestrial reproduction (7, 10, 11). Consequently, they are threatened by human activities in both terrestrial and marine environments, including direct harvest of meat and eggs (12), fisheries bycatch (13), coastal development (14, 15), pollution (16), disease (17), and climate change (18, 19), which is exacerbated by their temperature-dependent mechanism of sex determination (TSD) altering population dynamics (20, 21). The IUCN lists most sea turtle species as vulnerable or endangered, and while decades of conservation efforts have fueled positive trends for some populations (22), others continue to decline (23). In particular, leatherback turtles have undergone extensive declines (>95% in some populations) over the last century (24-27), including the extirpation of the Malaysian nesting population (28). Leatherback turtle recovery is also impeded by relatively low hatching success compared to other sea turtle species (29). In contrast, many green turtle populations have recently increased following conservation actions (22), but their continued recovery remains threatened by anthropogenic activities and high incidence of the neoplastic disease fibropapillomatosis (FP) (30).
Genomic data have been instrumental in advancing understanding of species’ evolutionary histories and ecological adaptations (31-33), and providing critical information for conservation management (34-37). However, this research has been hampered in taxa where genomic resources remain limited. In particular, the lack of high-quality reference genomes, which are essential for accurate comparative evolutionary analyses (38, 39) and estimates of a wide range of metrics to inform conservation biology (36, 40), impede this work in many threatened species. An early draft genome for the green turtle was assembled almost a decade ago (41), and provided important insights into turtle evolution. However, errors, gaps, and misassemblies in draft genomes can lead to spurious inferences, potentially masking signals of interest (38, 42). Well-annotated, chromosomal-level reference genomes can resolve these issues, improving our understanding of genomic underpinnings of ecological and evolutionary adaptations (39, 43). For example, high-quality genomes with accurate annotations have enabled examination of gene changes associated with recolonization of the marine environment by terrestrial vertebrates, including the loss of olfactory receptor (OR) gene families (32, 44). Comparative genomic analyses also demonstrated adaptive diversity in genes underlying reptilian immunity (45), with high-quality genomes providing insights into disease susceptibility (33, 46, 47). This is critical for sea turtles, with diseases such as FP adversely impacting populations across the globe (30), and information on immune genes key for devising effective conservation strategies (48). The contiguity of high-quality genomes is also invaluable for conservation-focused analyses, especially runs of homozygosity (ROH)and genetic load that provide insights into population demography and inbreeding depression, and are difficult to accurately quantify with fragmented, poorly-annotated genomes (49).
Here, we assembled chromosome-level reference genomes with species-specific annotations for leatherback and green turtles as part of the Vertebrate Genomes Project (VGP) to facilitate critical research centered around sea turtle evolutionary history and conservation. We conducted comparative analyses to explore the genomic basis of their shared and unique phenotypic traits, and compared their chromosomal organization, genomic diversity, and demographic histories. These genomes represent two of the most contiguous reptilian genomes assembled to date, providing invaluable resources for ongoing investigations into conservation and adaptation for this imperiled vertebrate lineage.
Results
Genome quality
The reference genomes of the leatherback and green turtles were generated using four genomic technologies following the VGP pipeline v1.6 (39), with minor modifications (see Methods). A total of 100% of the leatherback and 99.8% of the green turtle assembled sequences were placeable within chromosomes. The assembled genomes were near full-length (∼2.1 GB), with annotations of all 28 known chromosomes for both species, composed of 11 macrochromosomes (>50 Mb) and 17 microchromosomes (<50 Mb) (Tables 1 & S1, Fig. S1). These genomes are among the highest quality genomes assembled for non-avian reptiles to date in terms of both contiguity and completeness (Table S2), with the leatherback turtle assembly representing the first reptile genome where all scaffolds have been assigned to chromosomes. Scaffold N50s were high for both genomes (Table 1>). We annotated 18,775 protein-coding genes in leatherback turtle genome and 19,752 in the green turtle (see below for analysis of gene number differences). Of these, 96.9% and 97.5%, for leatherback and green turtles respectively, were supported over 95% of their length from experimental evidence and/or high-quality protein models from related species (see Methods). The number of protein-coding genes falls within the range for other reptilian genomes (Table S2) and includes 97.7% and 98.2% complete BUSCO copies when using Sauropsida models for leatherback and green turtles respectively (50), which are similar or higher proportions than all other assembled reptilian genomes to date (Fig. S2).
Genome architecture
Despite diverging over 60 million years ago (4), leatherback and green turtles have extremely high genome synteny and collinearity (Figs. 1b,c, S3, S4). After multiple rounds of manual curation to correct any artifacts of misassemblies, only a few larger structural rearrangements remained, including inversions of up to 7 Mb on chromosomes 12, 13, 24 and 28 (Fig. S3). The high collinearity between the two genomes included near complete end-to-end contiguous synteny for nine out of 28 chromosomes (Fig. S3). The remaining 19 chromosomes exhibited at least one small region of reduced collinearity (RRC) between the species, with RRCs representing a total of ∼83.4 Mb (∼3.9%) and ∼110.5 Mb (∼5.2%) of the leatherback and green turtle genome lengths, respectively. Eight chromosomes exhibited small RRCs (between 0.1–3 Mb), and 11 contained RRCs that were between 3–18 Mb in length (Figs. 2a-d & Table S3). Analyses of coding regions revealed a similar pattern of high collinearity between the two species at the gene level (Figs. 1c & S3), particularly within the macrochromosomes that contain more than 80% of the total length of the genomes.
The two genomes displayed similar percentages of repetitive elements (REs; 45.8% and 44.4%, respectively; Fig. S5 & Table S4), which were almost exclusively transposable elements (TEs; 30.5% and 27.4%) and unclassified repeats (14.6% and 16.5%, respectively). While both genomes carry similar proportions of REs, the leatherback turtle genome exhibited relatively longer TEs across all but two chromosomes when compared to the green turtle (Fig. S6a). The landscape of TE superfamily composition over evolutionary time is generally similar between the two species (Fig. S5), and consistent with other reptiles (51, 52). One striking difference, however, is seen in REs with low Kimura values (<5%), which appeared at much higher frequency in the leatherback turtle genome (Fig. S5), representing either relatively recent insertions or reflecting lower mutation rates in this species.
Gene families and gene functional analysis
Gene function analysis of localized RRCs revealed that most contain genes with higher copy numbers in the green turtle compared to the leatherback (Fig. 2a-d, Table S3). From the RRCs in the 19 chromosomes that had higher gene copy numbers in the green turtle, ten contained genes associated with immune system, olfactory reception and/or zinc-finger protein genes. In addition to localized RRCs, higher gene copy numbers in the green turtle occurred in many gene orthologous groups (orthogroups) across the entire genome, and generally in variable multicopy genes (Fig. 2f, g). Copy number variation accounted for most of the nearly one thousand more genes annotated in the green turtle genome relative to the leatherback (Fig. 2f, g; Table 1>). We detected no evidence of collapsed multicopy genes in the leatherback turtle assembly across multiple analyses (see Methods), supporting this as a biological signal rather than technical artifact of assembly.
Olfactory receptors (ORs) represented the largest orthogroups in both genomes, and differences in copy numbers were tightly connected to observed RRCs. All OR Class I genes were clustered at the beginning of chromosome 1, and the green turtle had higher copy numbers in this region (Fig. 2a-d). This area also contains a cluster of OR Class I genes in at least three additional testudinid species, and is the only divergent region across the very large chromosome 1 in the turtles analyzed (Fig. S7). In contrast, OR Class II genes were spread across several chromosomes in both sea turtle species, but again higher copy numbers in the green turtle were all found within RRCs (Fig. 2b-d). The instability and rapid evolution of OR gene numbers in turtles is further illustrated in the expansion-contraction analysis of orthogroups (Fig. 2e, Table S6a-d), which showed that OR Class I genes underwent a modest contraction in the ancestral sea turtle lineage, followed by an expansion in the green turtle but a further contraction in the leatherback turtle. Similar trends were detected for OR Class II genes, but with a greater magnitude of contraction in the ancestral sea turtle lineage followed by a further contraction for the leatherback turtle and only a small expansion for the green turtle.
A second important RRC encompassed the major histocompatibility complex (MHC), which plays a critical role in vertebrate immunity and has particularly strong relevance to sea turtle conservation due to the threat of FP and other diseases (32). In addition to the MHC genes, this RRC (RRC14) includes several copies of OR Class II genes, zinc-finger protein genes and other genes involved with immunity, such as butyrophilin subfamily members and killer cell lectin-like receptors (Fig. 2d, Table S12).
Invariably, the green turtle carried higher numbers of all the multicopy genes present in RRC14. RRCs on other chromosomes also showed increased levels of zinc-finger protein genes in the green turtle, including the RRCs labeled 6A, 11A, 14A, and 28 (Table S3). In particular, zinc-finger protein genes were highly prevalent on chromosomes 14 and 28 in both sea turtles, representing more than 50% of all the protein domains present on these chromosomes (Fig. S8).
Finally, given the critical importance of understanding the underlying mechanism of temperature-dependent sex determination (TSD) in the face of climate change, we analyzed genes known to be associated with TSD across reptiles. Almost all 216 genes previously implicated in male- or female-producing pathways in reptilian species with TSD were single-copy genes in both sea turtle species (Table S7; 210 genes per species). Only three genes (MAP3K3, EP300, and HSPA8) were duplicated in both genomes, with the copies located on different chromosomes in all cases. Moreover, homologous genes were generally located in the same region of the genomes for both species (Fig. S9), and missing genes were typically absent in both species, with only four genes found in one species but not the other (Table S7).
Macro and microchromosomes
Microchromosomes contained higher proportions of genes than macrochromosomes (Fig. 3a,b), and gene content was strongly positively correlated with GC content (Fig. S10). These patterns were particularly apparent in small (<20 Mb) microchromosomes, where GC content reached 50%, compared to the 43 - 44% genome-wide averages. Within chromosome groups, larger proportions of multicopy genes were generally associated with higher total gene counts, and chromosomes with the highest multicopy genes numbers have increased proportions of RRCs (Fig. 3a,b).
Mean genetic distances for single-copy regions between the two sea turtles were also higher in small microchromosomes (0.053) compared to both intermediate (>20 Mb) microchromosomes (0.047), and macrochromosomes (0.045) (Fig. 3c). However, examination of intermediate microchromosome and macrochromosome RRCs revealed elevated genetic distances in these regions that approached values observed in small microchromosomes (Table S8). Genetic distances were also positively correlated with heterozygosity, which was higher in small microchromosomes for both species (Figs. 3d & S11-13).
Genome diversity
Although both species displayed similar patterns of higher heterozygosity in microchromosomes than macrochromosomes (Figs. 3d & S11-13), they differed in the genome-wide nucleotide diversity level by almost an order of magnitude (repeat masked π = 3.19 × 10-4 leatherback and 22.2 × 10-4 green turtle; Fig. S12, Table S9). Exonic regions exhibited lower levels of heterozygosity than non-coding regions (Fig. 4a, Table S9), with a greater reduction in heterozygosity within green turtle exons (∼20%) than leatherback turtle exons (∼10%) compared to genome-wide levels. In addition, the percentage of 100 Kb windows containing zero heterozygous sites was higher in the green turtle (6.60%) than the leatherback turtle (2.87%), suggesting that although diversity was lower overall in the leatherback turtle, it was more evenly spread across the genome than in the green turtle. Using a standardized heterozygosity pipeline (see Methods; Fig. 4b), we found the genomic diversity of the leatherback turtle was substantially lower than almost all other reptiles examined, including Chelonoidis abingdonii, where low diversity has been considered a contributing factor to their extinction (53), and only the critically endangered Chinese alligator (Alligator sinensis) showed lower diversity (54). In contrast, the genomic diversity of the green turtle fell in the mid-range for reptiles, as well as similar analyses conducted on mammals (55, 56).
Finally, we identified high-diversity exonic regions using multiple approaches (see Methods), and found that many contained immune, OR, and zinc-finger protein genes in both species, but especially for the green turtle which showed a greater number of high-diversity windows (Fig 4c; Table S10). Given the striking similarity to the RRC analysis results above, these findings independently reinforced the importance of these gene families in the divergent evolution of these species.
Runs of homozygosity (ROH)
The leatherback turtle had a greater number of ROHs (>100 Kb) compared to the green turtle (NROH = 2,045 and 873, respectively), as well as higher accumulated length and proportion of the genome in ROH (SROH = 400.61 Mb (18.51% of genome) and 327.06 Mb (15.53%), respectively). The average length of ROHs was generally shorter in the leatherback turtle (LROH = 196 Kb and 375 Kb for the leatherback and green turtles, respectively; Fig. 4d), with the accumulated length of short (<500 Kb) ROHs highest when compared to medium (500 Kb-1 Mb) and long ROHs (>1 Mb) (Fig. 4d). The leatherback turtle genome only showed one ROH that was greater than 1 Mb in length, suggesting that recent bottlenecks or inbreeding are unlikely, rather that this species has maintained long-term low diversity. In contrast, the green turtle had 54 ROHs longer than 1 Mb, suggestive of a possible more recent population bottleneck or inbreeding events. The average lengths of ROHs were also higher in macrochromosomes than microchromosomes (Fig. S15).
Genetic load
Coding region variants of the leatherback turtle genome were found more likely to be impactful, with 0.10% and 0.07% of variants predicted to cause ‘high impacts’ (e.g., stop-codon gain or loss) for the leatherback and green turtles, respectively, and with ‘moderately’ and ‘low’ impact variants also higher in the leatherback turtle (Fig. 4e). Additionally, the missense to silent mutation ratio was higher in the leatherback (0.89) than green turtle (0.65), again indicating that genetic load is higher for the leatherback turtle. High-impact variants predicted by snpEff only occurred in one species for any given gene. The 103 and 357 nucleotide variants characterized as ‘high’ impact in the leatherback and green turtle were found within 59 and 171 unique genes, respectively. The functions of these genes were variable (Fig. S16). For the leatherback turtle, many of the genes impacted were linked to cell transport and demethylation, with DEAH-box helicase 40 and a Hsp40 family member also impacted (Table S13). In contrast, for the green turtle, many of the genes were linked to immunity, including an MHC class I alpha chain gene, as well as B-cell receptors and killer-cell receptors. The green turtle also showed putative high impact variants within several OR genes.
Demographic history
Pairwise Sequential Markovian Coalescence (PSMC) analyses indicated different historical effective population sizes (Ne) between the two sea turtle species (Fig. 4f). The results indicate that the Ne for the leatherback turtle has been relatively small and sustained, ranging in size from approximately 2,000 to 21,000 over the last 10 million years, and at the lower end of this range for the last 5 million years. In comparison, the green turtle has experienced wider population fluctuations and a relatively higher overall Ne suggesting that Ne has fluctuated between approximately 44,000 and 83,000. While the Ne for the leatherback turtle is relatively low, it showed signs of increasing abundance prior to the Eemian warming period (Fig. 4f [H]), with a subsequent decrease during this period until the last glacial maximum (LGM). In contrast, the green turtle had three distinct peaks in Ne (Fig. 4f), potentially associated with ocean connectivity changes related to the closure of the Tethys Sea [A], the Pleistocene period [B], and a more pronounced peak that aligns with later marked temperature fluctuations [C]. We observed similar patterns for PSMC analyses conducted on additional individuals for both species (see Supplementary Appendix I).
Discussion
Divergence in localized RRCs and microchromosomes amidst high global genome synteny
The lineages leading to leatherback and green turtles diverged over 60 MYA (4), giving rise to species that are adapted to dissimilar habitats, diets, and modes of life. Despite high overall levels of genome synteny across both the macro- and microchromosomes between the sea turtle families, RRCs and small microchromosomes were particularly associated with high concentrations of multicopy gene families, as well as heightened genomic diversity and genetic distances between species, suggesting that these genomic elements may be important sources of variation underlying phenotypic differentiation. Though our results here do not demonstrate direct causality, we have identified candidate regions and gene families that can be targeted in further studies quantifying evidence for positive selection and their roles in sea turtle adaptation and speciation.
The high global stability of both macro- and microchromosomes between sea turtle families also aligns with recent work showing similar patterns across reptiles including birds, emphasizing the important roles of microchromosomes in vertebrate evolution (57). However, it is not yet clear if the characteristics of microchromosomes and RRCs we observed are unique to sea turtles, or, more likely, commonly observed in other vertebrates. Our detailed analyses of RRCs, microchromosomes, and their associated genes were only possible due to the high-quality of the assembled sea turtle genomes because these analyses can be sensitive to genome fragmentation and misassemblies (39). The prevalence or importance of such localized genomic differentiation among other closely or more distantly related reptiles or other vertebrate groups has not been evaluated due to a lack of equivalent genomic resources, but this is rapidly changing. As chromosomal-level genomes across all vertebrate lineages soon become available, our work provides a roadmap for identifying genomic regions harboring contrasting expansion/contractions of gene families and diversity levels in different vertebrate lineages. For taxa with highly conserved genomes like sea turtles, analyses of RRCs and microchromosomes are likely important to understand their divergent evolutionary histories and the phenotypic connections of the genes within them.
Contrasting sensory and immune gene evolution between sea turtle families
Sea turtles have complex sensory systems and can detect both volatile and water-soluble odorants, which are imperative for migration, reproduction and identification of prey, conspecifics, and predators (58-62). However, leatherback and green turtles occupy dissimilar ecological niches that depend on different sensory cues. While leatherback turtles inhabit the pelagic environment their entire lives post-hatching, performing large horizontal and vertical migrations to seek out prey patches of jellyfish and ctenophores (63), green turtles recruit as juveniles to neritic coastal and estuarine habitats and can have highly variable diets (64, 65). Substantial differences have been detected in the morphology of sea turtle nasal cavities, with leatherback turtle cavities relatively shorter, wider, and more voluminous than chelonids (66-68), suggesting reduced requirements for olfactory reception. OR genes encode proteins used to detect chemical cues, with the number of OR genes present in a species’ genome strongly correlated to the number of odorants that it can detect (69), and linked to the chemical complexity of its environment (70). The two major groups of ORs in amniote vertebrates are separated by their affinities with hydrophilic molecules (Class I) or hydrophobic molecules (Class II) (71). Class I OR genes may be particularly important in aquatic adaptation (32), and expansions of Class I ORs in testudines, including green turtles, have been previously reported, although with some uncertainty due to the use of short-read assemblies (32, 41, 72). Our reconstruction of both Class I and Class II OR gene evolution throughout the sea turtle lineage revealed that after ancestral contractions, gene copy evolution diverged in opposite directions between the sea turtle families. The greater loss of Class II compared to Class I ORs in the ancestral sea turtle lineage likely reflects relaxed selection for detection of airborne odorants, as has been observed in other lineages that recolonized marine environments, including marine mammals (73). However, as sea turtles continue to use terrestrial habitats for reproduction, they need to retain some of these capabilities, which could explain why the contraction was weaker than observed in fully marine species (e.g., the vaquita Phocaena sinus; Fig. 2e).
The strong Class I OR expansion in the green turtle may be related to its distribution in complex neritic habitats and variable diet, requiring detection of a high diversity of waterborne odorants, while the continued loss of ORs in the leatherback turtle could be a consequence of its more specialized diet and lower complexity of pelagic habitats. Although leatherback turtles can detect jellyfish chemical cues, sensory experiments have indicated that visual cues are more important for food recognition in this species (74). Additionally, while the precise mechanisms underpinning philopatry in sea turtles still remain unclear, green turtles are thought to use olfactory cues to reach natal nesting beaches following long-distance navigation guided by magnetoreception (60, 62). Leatherback turtles exhibit more ‘straying’ from natal rookeries than other species, and such relaxed philopatry may be related to reduced capabilities or reliance on olfactory cues to hone in on specific beaches.
The diversity of the highly-complex MHC region is a key component in the vertebrate immune response to novel pathogens, with greater gene copy numbers and heterozygosity linked to lower disease susceptibility (75). While both sea turtle species contained most of the core MHC-related genes, the green turtle had more copies of genes involved in adaptive as well as innate immunity. Pathogen prevalence and persistence is often greater in neritic habitats than open ocean habitats (76), so green turtles may be exposed to higher pathogen loads and diversity than leatherback turtles (77). However, reptilian immune systems are understudied compared to other vertebrates, and very few studies of MHC genes have been conducted in turtles (78). Thus, it is not yet understood how immune gene diversity translates into disease susceptibility or ecological adaptation in sea turtles, which is particularly critical for their conservation as FP continues to threaten the recovery of populations around the globe (30). Although this viral-mediated tumor disease occurs in all sea turtle species, there is high variation between species and populations in disease prevalence and recovery, making it plausible that harboring certain genes, copy numbers, or specific alleles may play important roles in disease dynamics. Despite decades of research there have been no studies of the immunogenomic factors governing FP susceptibility or resilience, in part due to difficulty in accurately quantifying hypervariable and complex MHC loci with short-read sequencing technologies (79). Our reference genomes now enable studies accurately interrogating MHC and other immune loci to close this critical research gap, and advance our fundamental understanding of immune gene evolution in sea turtles.
Conservation of reproductive genes and repetitive elements
In contrast to olfactory and immune genes, almost all genes with a priori linkages to TSD pathways (80-82) occurred as single copy orthologs with highly conserved chromosomal locations between the two species. This is likely indicative of strong selection for conservation of this reproductive pathway, but our understanding of the specific roles these genes play in sea turtle TSD remains limited. Resolving whether inter- (83) and intra-specific (84) variations in thermal thresholds are due to the few genes that diverged from the general pattern we observed, functional sequence variation between orthologs, or other factors (e.g., epigenetic processes) is of high conservation concern for sea turtles (85), as climate warming is expected to skew sex ratios and alter population demographics (86) in the absence of substantial plasticity or adaptation. Our results serve as the foundation for these much-needed studies to quantify genomic mechanisms of TSD in sea turtles and determine their adaptive capacity to persist under climate change.
While REs in turtles have been investigated for over 30 years (87, 88), few studies have directly addressed the distribution and diversity of REs within testudine genomes (89). Both sea turtle genomes have substantially larger RE compositions (>40%) than previous estimates for other turtle species (41, 89, 90), including the draft genome of the green turtle (10% of the genome (41)). Interestingly, more recent reptile genome assemblies show higher proportions of REs (90, 91), with results similar to our estimates. The benefits of whole-genome approaches are further highlighted in the tuatara, where initial RE estimates suggested <10% of the genome was composed of REs (92), yet a subsequent whole-genome assembly increased this estimate to 64% (45). Collectively, these results support the notion that RE patterns could be more conserved across non-avian reptiles than previously believed, and the continued application of recent advances in genome sequencing, assembly methods, and analyses are needed to better understand the RE patterns and the processes that generate them (39, 43).
Differential genomic diversity and demographic histories
Genomic diversity is a critical metric for evaluating extinction risk and adaptive potential to environmental perturbation (93-95), with heterozygosity positively correlated with individual fitness (see reviews by (96, 97). Understanding the causes and consequences of genomic diversity is imperative for sea turtles, and for leatherback turtles in particular, where contemporary populations have experienced recent sharp declines due to human activities (25). The leatherback turtle genome exhibited exceptionally low diversity relative to the green turtle, other reptiles and mammals, broadly aligning with previous estimates (98, 99). However, factors influencing genomic diversity can vary among species (100), and our PSMC and ROH results indicate that low diversity in the leatherback turtle is likely a consequence of long-term low effective population sizes rather than recent population reductions. This is consistent with mitochondrial analyses suggesting that contemporary populations radiated from a small number of matriarchal lineages within a single refugium following the Pleistocene (99). The low, relatively evenly spread heterozygosity is also congruent with sustained low population sizes with frequent outbreeding similar to that observed in several mammal species (101, 102). Also encouragingly, although the reference genome was generated from an individual from the West Pacific leatherback turtle population that has suffered precipitous declines (103), we did not detect patterns consistent with recent inbreeding. This suggests that if ongoing anthropogenic threats are mitigated, the population may still be large enough to avoid complications arising through inbreeding depression during recovery. However, the possibility that population declines have occurred too rapidly for the impacts of inbreeding to yet be detected warrants cautious optimism and the need for continued genomic monitoring of the population. In contrast, the higher genomic diversity with some long ROHs observed in the green turtle likely reflects their radiation from many refugia (104), as well as relatively recent inbreeding events. This is potentially because the green turtle genome was generated from an individual from the Mediterranean population where nesting populations are relatively small (105) which, combined with strong natal philopatry (106), may increase the chance of inbreeding.
Regardless of the causes of current genomic diversity levels in sea turtles, the amount of standing variation may have important implications for their future persistence (107), especially given the adaptive capacity likely required to keep pace with rapid anthropogenic global change. With extremely low genomic diversity and a higher genetic load compared to the green turtle, the risks are presumably of greater concern for leatherback turtles. Additionally, leatherback turtles have substantially lower hatching success compared to other sea turtle species (29) that is potentially related to the heightened genetic load and low heterozygosity (108, 109), and may combine with other factors to slow population recoveries following conservation measures. However, recent studies have documented low genome diversity in a number of species with wide geographic distributions and relatively large census population sizes, including some long-lived marine vertebrates (101, 110–113). Other species with low diversity have rebounded following population declines and/or appear to have purged deleterious alleles through long-term low population sizes (111, 114, 115), thereby limiting the impact of low genomic diversity on viability (56, 111, 116). Although our results of a greater genetic load despite long-term low Ne suggest this is not the scenario for leatherback turtles, further assessments of more individuals over greater spatial and temporal are needed. Studies enabled by the reference genomes presented here quantifying diversity and genetic load within and among global populations will clarify these relationships for leatherback turtles and other sea turtle species to guide conservation recommendations.
In contrast to the long-term low Ne of leatherback turtles, our demographic reconstructions showed the Ne of the green turtle has fluctuated widely over the same period. These fluctuations appear correlated with climatic events, beginning with the closure of the Tethys Sea, which altered ocean connectivity and represented a period of increasing temperatures that may have opened more suitable habitat. As temperatures subsequently decreased, Ne also decreased, however temperature fluctuations during the Pleistocene were associated with an additional increase in Ne, and a subsequent increase in Ne was associated with warmer temperatures following the Eemian period. While warmer temperatures presumably allowed for larger population sizes of green turtles, spikes in Ne (e.g., ∼100KYA) are most likely associated with mixing of previously isolated populations due to warm-water corridors allowing movement between populations and ocean basins (117). The lower, long-term Ne of leatherback turtles may reflect a reduced census size associated with this species’ greater mass and trophic position.
Following an initial decrease in Ne associated with declining temperatures, the Ne of leatherback turtles remained relatively constant throughout the fluctuations of the Pleistocene. Although reptiles are generally sensitive to climatic thermal fluctuations, leatherback turtles exhibit unique physiological adaptations that produce regional endothermy and facilitate exploitation of cold-water habitats (6) that potentially led them to being less susceptible to periods of cooler temperatures. While our overall estimates and trends for both species were broadly concordant with previous studies (99, 118, 119), a recent study using MSMC found steep declines in Ne for green turtles >100,000 years before present (119), which was not detected in our PSMC analyses. Since this decline was also not detected in a prior study using PSMC on the draft the green turtle genome (118), this is likely a consequence of the different methods, with MSMC analyses inferring large Ne for more ancient time scales (120).
Enabling future research and conservation applications
In addition to the insights reported here, the reference genomes for both extant sea turtle families provide invaluable resources to enable a wide breadth of previously unattainable fundamental and applied research. Combined with other forthcoming chromosomal-level vertebrate genomes, in-depth comparative genomics analyses can further investigate ecological adaptation related to immune and sensory gene evolution, as well as the genomic basis for traits of interest such as adaptation to saltwater, diving capacity, and long-distance natal homing. Studies leveraging these reference genomes alongside whole-genome sequencing of archival sample collections can assess how genomic erosion, inbreeding and mutational load are linked to population size, trajectories, and conservation measures in global sea turtle populations. For instance, the fact that leatherback turtles have persisted with low diversity and Ne for long time periods offers hope for their recovery, but given that some populations have now been reduced to only a few hundred individuals (103), research quantifying purging of deleterious alleles, inbreeding depression and adaptive capacity within populations is urgently needed (121). Many conservation applications that may not require whole-genome data can also benefit from the utility of these reference genomes, including the development of amplicon panels and molecular assays to investigate TSD mechanisms and adaptive capacity under climate change, and assessing linkages between immune genes and disease risk. Finally, with global distributions and long-distance migratory connectivity, sea turtle conservation requires international collaboration that has been previously hampered by difficulty comparing datasets between laboratories. Existing anonymous markers can now be anchored to these genomes, and new ones can be optimized for conservation-focused questions and shared across the global research community, facilitating large-scale syntheses and equitable capacity building for genomics research. While ongoing anthropogenic impacts continue to threaten the viability of sea turtles to persist over the coming century, combined with the important work of reducing major threats such as fisheries bycatch and habitat loss, these genomes will enable research that make critical contributions to recovering imperiled populations.
Methods
Sample collection, genome assembly and annotation
Blood was collected from leatherback and green turtles using minimally invasive techniques for isolation of ultra-high molecular weight DNA, and tissue samples of internal organs for RNA were collected opportunistically from recently deceased or euthanized animals. Full details of sample collection, storage, and laboratory processing prior to sequencing can be found in Supplementary Appendix I. Resulting raw data were deposited into the VGP Genome Ark and NCBI Short-Read Archive (SRA) (see Data Accessibility Statement). We assembled both genomes using four genomic technologies following the VGP pipeline v1.6 (39) with a few modifications detailed in Supplementary Appendix I. Briefly, PacBio Continuous Long Reads were assembled into haplotype phased contigs, with contigs scaffolded into chromosome-level super scaffolds using a combination of 10X Genomics linked reads, Bionano Genomics optical maps, and Arima Genomics Hi-C 3D chromosomal interaction linked reads.
Base call errors were corrected to achieve high quality (>Q40). The assemblies were manually curated, with structural errors corrected according to the Hi-C maps (Fig. S1), and the 28 super scaffolds (hereinafter referred to as chromosomes) numbered in both species according to sequence lengths in the leatherback turtle assembly, and synteny between the two species. A manual inspection comparing the sequence collinearity between the first curated versions of the genomes revealed a small number of artefactual sequence rearrangements that were corrected in a second round of manual curation (see Supplementary Appendix I).
To enable accurate, species-specific annotations for each genome, both short and long-read transcriptomic data (RNA-Seq and Iso-Seq) were generated from tissues known for their high transcript diversity in each species. These data, plus homology-based mapping from other species, were used to annotate the genomes using the standardized NCBI pipeline (122). Briefly, we performed annotation as previously described (39, 123), using the same RNA-Seq, Iso-Seq, and protein input evidence for the prediction of genes in the leatherback and green turtles. We aligned 3.5 billion RNA-Seq reads from eight green turtle tissues (blood, brain, gonads, heart, kidney, lung, spleen and thymus) and 427 million reads from four leatherback turtle tissues (blood, brain, lung and ovary) to both genomes, in addition to 144,000 leatherback turtle and 1.9 million green turtle PacBio IsoSeq reads, and all Sauropsida and Xenopus GenBank proteins, known RefSeq Sauropsida, Xenopus, and human RefSeq proteins, and RefSeq model proteins for Gopherus evgoodei and Mauremys reevesii.
Genome quality analysis
We used the pipeline assembly-stats from https://github.com/sanger-pathogens/assembly-stats to estimate the scaffolds N50, size distributions and assembly size. BUSCO analysis (115) and QV value estimations (116) were conducted to assess the overall completion, duplication, and relative quality of the assemblies. We used D-GENIES (118) with default parameters to conduct dot plot mapping of the entire genomes and each individual chromosomes to evaluate the synteny between leatherback and green turtle genomes, and Haibao Tang JCVI utility libraries following the MCScan pipeline (119) to verify the contiguity of the genomes. Incongruences in gene synteny blocks were manually investigated using Artemis Comparative Tool (120), identifying possible regions of inversion that could be caused by artifacts during assembly. These regions were then identified and corrected in the latest version of the assembly for both species. Only a few structural rearrangements between the two species remained after two rounds of manual curation with support of sequencing data. The final curated assemblies were analyzed using the Genome Evaluation Pipeline (https://git.imp.fu-berlin.de/cmazzoni/GEP) to obtain all final QC plots and summary statistics.
Identification and analysis of RRCs and REs
Leatherback and green turtle genomes were mapped to each other using Minimap2 with a dot plot of the mapping generated using D-GENIES (124). Using windows of 20 Mb, the dot plot was screened visually with regions larger than 1 Mb showing reduced collinearity (i.e., one or more breaks in the diagonal indicating homology), as well as smaller regions with obvious signals of genomic rearrangements (e.g., inversions), cataloged as regions of reduced collinearity (RRCs). Several genomic features were examined within these regions and compared to regions of the same length directly up- and down-stream of the RRCs (Table S3). We identified the functions of the genes present in RRCs using genome annotations and identified protein domains using Interproscan (125). The proportion of GO terms in each chromosome was estimated for each species using PANTHER (126); Fig. S16). To examine if RRCs presented differential patterns of sequence and/or gene duplication between the species, we aligned the genomes of the sea turtles against each other using Progressive Cactus (127, 128), and all homologous genes that presented more than one copy for one of the two species were isolated using an inhouse script (IdentifyDupsReciprocalBlast.sh) to retrieve duplicated genes (see Supplementary Appendix I for further details on Cactus alignments). Repetitive elements (REs) were identified by creating a de novo database of transposable elements using RepeatModeller2 (129), followed by running RepeatMasker (130, 131) to calculate Kimura values for all REs (see full analysis details in Supplementary Appendix I).
Gene families and gene functional analysis
To estimate the timing of gene family evolution for the OR gene families on sea turtles we used Computational Analysis of gene Family Evolution v5 (132). Briefly, CAFE5 uses phylogenomics and gene family sizes to identify gene family expansions and contractions, we used a dataset containing 8 species of turtle, 4 non-turtle reptiles, 3 mammals and 1 amphibian using OrthoFinder (133, 134). OR orthogroups were grouped based on subfamily (Class I and Class II; see (72)), and an ultrametric phylogeny was generated by gathering 1:1 orthologs. We then aligned amino acid sequences for each orthogroup and generated a phylogenetic tree (see Supplementary Appendix I for details).
Like many reptile species, sea turtles possess TSD. We compiled a list of 217 genes that have been implicated in TSD in reptiles (see Table S7). To determine if these genes were present in our assembled genomes, we employed two methods of investigation. We firstly searched the genome annotations for gene identifiers and protein names, followed by a BLAST search of homologous sequences to account for variations in gene identifiers between taxonomic groups (see Supplementary Appendix I for details). Resultant locations on both genomes were plotted on a Circos plot CIRCA (http://omgenomics.com/circa).
To identify genes related to immunity, and the MHC in particular, we searched the genome for the list of core MHC genes provided in Gemmell et al. (2020) (45). Genes were searched for in a similar way to the method used for the TSD genes, with initial searches of gene identifications, followed by a search of protein identifiers. As genes associated with the MHC are diverse, and vary substantially among species, we did not use a BLAST search for these genes. Locations of the genes were then compared between species to determine which genes were annotated, and where the core MHC region is located within the genomes.
Genetic distance, genome diversity, runs of homozygosity, and historical demography
In order to estimate the genetic distance between the leatherback and green turtle genomes, we used the halSnps pipeline (135) to compute interspecific single variants based on genome alignments obtained with Progressive Cactus (127, 128) using the leatherback turtle genome as the reference. Genetic distances were calculated for windows across the genome where each window included exactly 10,000 positions presenting single alignments against the green turtle genome in the Cactus output. Positions with zero, or more than one alignment were ignored, and if this occurred over more than 50% of a given window, it was skipped entirely (i.e., each window analyzed covered between 10 and 20 Kb of the genome). Interspecific distances per bp were calculated by dividing the number of variants found within a window by 10,000.
We calculated genome-wide heterozygosity using a method adapted from Robinson et al. (2019) (102). Briefly, we used the Genome Analysis Toolkit (GATK) (136) to call genotypes at every site across the genome using the 10X reads sourced from the reference individual mapped back to the reference genomes using BWA-mem (137). Heterozygosity was calculated within 100 Kb non-overlapping windows, with only sites that had a depth of between ⅓ × and 2× mean coverage retained for genotype scoring. Heterozygosity was calculated within these windows for (1) the entire genome, (2) the genome with repeat-regions masked, (3) only exon regions, (4) and for regions that were classified as ‘non-exons.’ We also adapted this pipeline to generate genome-wide heterozygosity for a number of additional reptilian and outgroup species with sequences sourced from the NCBI SRA where species-specific reference genomes were available (see details in Supplementary Appendix I).
ROHs were calculated by initially generating an additional SNP-list using whole-genome resequenced information from five additional individuals for each species (Table S11). This SNP-list was generated through the Analysis of Next Generation Sequencing Data (ANGSD; (138) pipeline due to the low- to moderate-coverage of the additional samples (∼2-13×). ANGSD was parameterized to output files that were configured for use as input for the ROH analysis incorporated in PLINK (139). ROHs were then further characterized as ‘short’ (100-500 Kb), ‘medium’ (500Kb-1 Mb), or ‘long’ (>1 Mb) based on their length. ROHs for only the reference individuals are presented.
The alignments of the 10X reads for the reference individuals were also used as input for Pairwise Sequential Markovian Coalescence (PSMC; (140)) analysis of demographic history for both species. We used SAMtools (141) and BCFtools (142) to call genotypes with base and mapping quality filters of >Q30. We also filtered for insert size (50-5,000bp) and allele balance (AB) by retaining only biallelic sites with an AB of <0.25 and >0.75. We then ran PSMC analysis using the first 10 scaffolds, which constituted over 84% of the total length of the genome, using a generation time of 30 years (mid-way between reported generation times for both species; see Supplementary Appendix I), and a mutation rate of 1.2 × 10-8 (118).
Genetic load
Estimates of deleterious allele accumulation were conducted using the snpEff variant annotation software (143). We estimated the impacts of variants from coding regions using the species-specific genome annotations generated for both species, with a total of 18,775 genes for the leatherback turtle genome, and 19,752 genes for the green turtle genome used in the analysis. Variants were only included in the analyses if they met stringent quality requirements, with loci filtered during genotyping based on depth of coverage (⅓ × - 2× mean coverage) and base quality metrics (Q < 20). The snpEff program predicts variant impacts and bins them into ‘high’, ‘moderate’, or ‘low’ impact categories, and outputs a list of genes that have predicted variant effects.
Funding
Funding was provided by the University of Massachusetts Amherst, NSF-IOS (grant #1904439 to LMK), NOAA-Fisheries, Vertebrate Genomes Project, Rockefeller University, to EDJ, HHMI to EDJ, the Sanger Institute, Max-Planck-Gesellschaft, as well as grant contributions from Tom Gilbert, Paul Flicek, Robert Murphy, Karen A. Bjorndal, Alan B. Bolten, Ed Braun, Neil Gemmell, Tomas Marques-Bonet, and Alan Scott. We also acknowledge CONICYT-DAAD for scholarship support to TCV, and EKSR was supported by São Paulo Research Foundation - FAPESP (grant #2020/10372-6). BeGenDiv is partially funded by the German Federal Ministry of Education and Research (BMbF, Förderkennzeichen 033W034A). The work of FT-N and PM was supported by the Intramural Research Program of the National Library of Medicine, National Institutes of Health. The work of MP was partially funded through the Federal Ministry of Education and Research (grant 01IS18026C). HP was supported by a Formació de Personal Investigador fellowship from Generalitat de Catalunya (FI_B100131). MK was supported by “la Caixa” Foundation (ID 100010434), fellowship code LCF/BQ/PR19/11700002 and the Vienna Science and Technology Fund (WWTF) and the City of Vienna through project VRG20-001.
Data Accessibility Statement
Assemblies for both species have been deposited on NCBI GenBank. The NCBI GenBank accession numbers for the leatherback turtle genome assembly (rDerCor1) are GCF_009764565.3 and GCA_009762595.2 for the annotated primary and original alternate haplotypes in BioProject PRJNA561993, and for the green turtle assembly (rCheMyd1) are GCF_015237465.2 and GCA_015220195.2 for primary and alternate haplotypes respectively in BioProject PRJNA561941. The raw data used for assemblies are available on the Vertebrate Genome Ark (https://vgp.github.io/genomeark/). The leatherback turtle RNA-Seq data generated for the purpose of assembly annotation was deposited in the SRA under accession numbers SRX8787564-SRX8787566 (RNA-Seq) and SRX6360706-SRX6360708 (ISO-Seq). Green turtle RNA-Seq data generated for annotation were deposited in SRA under accessions SRX10863130-SRX10863133 (RNA-Seq) and as SRX11164043-SRX11164046 (ISO-Seq). All scripts used for downstream analyses following genome assembly and annotation have been deposited on GitHub under repository https://github.com/bpbentley/sea_turtle_genomes.
Acknowledgments
We are grateful for the assistance with the (1) leatherback turtle sample collection from the St. Croix Sea Turtle Program and the US Fish and Wildlife Service, the NOAA-SWFSC California in-water leatherback research team, and the New England Aquarium; (2) green turtle sample collection from the Israel National Sea Turtle Rescue Centre, the NOAA PIFSC-MTBAP team, and Thierry Work (USGS). We also thank Estefany Argueta and Jamie Stoll for assistance with literature searches for TSD and immune genes, and Phillip Morin, Andrew Foote, Anna Brüniche-Olsen, Annabel Beichman, Morgan McCarthy, David L. Adelson, and Yuanyuan Cheng for their invaluable discussions surrounding analysis approaches and comments on the manuscript. Sequencing of the green turtle has been performed by the Long Read Team of the DRESDEN-concept Genome Center, DFG NGS Competence Center, part of the Center for Molecular and Cellular Bioengineering (CMCB), Technische Universität Dresden and MPI-CBG.
Footnotes
Updated manuscript and supplementary information uploaded after the identification of some errors.
Abbreviations
- TE
- transposable element
- RE
- repetitive element
- RRC
- region of reduced collinearity
- FP
- Fibropapillomatosis
- ROH
- runs of homozygosity
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.
- 60.↵
- 61.
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵
- 105.↵
- 106.↵
- 107.↵
- 108.↵
- 109.↵
- 110.↵
- 111.↵
- 112.
- 113.↵
- 114.↵
- 115.↵
- 116.↵
- 117.↵
- 118.↵
- 119.↵
- 120.↵
- 121.↵
- 122.↵
- 123.↵
- 124.↵
- 125.↵
- 126.↵
- 127.↵
- 128.↵
- 129.↵
- 130.↵
- 131.↵
- 132.↵
- 133.↵
- 134.↵
- 135.↵
- 136.↵
- 137.↵
- 138.↵
- 139.↵
- 140.↵
- 141.↵
- 142.↵
- 143.↵