Abstract
The harbour porpoise (Phocoena phocoena) is a highly mobile cetacean found in waters across the Northern hemisphere. It occurs in coastal water and inhabits water basins that vary broadly in salinity, temperature, and food availability. These diverse habitats could drive differentiation among populations; population structure within the north Atlantic (north of 51° latitude) is not fully resolved, particularly in relation to Baltic Sea populations. Here we report the first harbour porpoise genome, assembled de novo from a Swedish Kattegat individual. The genome is one of the most complete cetacean genomes currently available, with a total size of 2.7 Gb, and 50% of the total length found in just 34 scaffolds. Using the largest scaffolds, we were able to examine chromosome-level rearrangements relative to the genome of the closest related species available, domestic cattle (Bos taurus). The draft annotation comprises 22,154 predicted gene models, which we further annotated through matches to NCBI nucleotide database, GO categorization, and motif prediction. To infer the adaptive abilities of this species, as well as their population history, we performed Bayesian skyline analysis of the genome, which is concordant with the demographic history of this species, including expansion and fragmentation events. Overall, this genome assembly, together with the draft annotation, represents a crucial addition to the limited genetic markers currently available for the study of porpoise and cetacean conservation, phylogeny, and evolution.
Introduction
As an apex predator, the harbour porpoise (Phocoena phocoena) is a key indicator for conservation and biodiversity measurements in the Nordic Seas (Hooker & Gerber, 2004; Lawrence et al., 2016; Sergio et al., 2008). Marine mammals in particular face many threats from their environment (Fietz et al., 2013; Godard-Codding et al., 2011) including noise pollution (Dyndo et al., 2015; Nabe-Nielsen et al., 2014), marine debris and by-catch (Scheidat et al., 2008; Unger et al., 2017), predation by grey seals (Leopold et al., 2014), and infectious diseases (Siebert et al., 2001; van Beurden et al., 2017). These threats impact structure, boundaries, and stability of populations. This is especially true in the Kattegat/Baltic Sea area, where broad ecological shifts have occurred on a relatively short time scale. Since forming 15,000 years ago, the Baltic has undergone periods of brackish, marine, and completely fresh water, and encountered increasing and continuous humans impacts including eutrophication, pollution and overharvesting (Korpinen et al., 2012; Paasche et al., 2015; Ukkonen et al., 2014; Varjopuro et al., 2014). This geological history has created a series of challenges to marine species, and has likely fostered local adaption and population differentiation.
Harbour porpoises are the most abundant costal cetaceans across their wide distribution from sub-polar to temperate waters in the Northern hemisphere (Fontaine et al., 2017; Gaskin, 1984). As one of the smallest marine mammals, they belong to the Delphinoida and are the sister group to the Monodontidae (Gatesy et al., 2013; Geisler et al., 2011; Hassanin et al., 2012). Three subspecies of harbour porpoise, P. p. vomerina (North Pacific), P. p. relicta (Black Sea) and P. p. phocoena (North Atlantic), can be differentiated genetically (Rosel et al., 1999), but also by morphological traits including body size and diet (Fontaine et al., 2017; Galatius et al., 2012).
The population size of harbour porpoises in European Atlantic Shelf waters is estimated to be 375,000 with shifts across the last decade in the exact regions they occupy (e.g. in the North Sea; (Hammond et al., 2013). Estimates of population size in the western Baltic Sea are smaller, approximately 40,000 animals (Benke et al., 2014; Scheidat et al., 2008; Viquerat et al., 2014). The Baltic Sea proper population, which is not included in the former surveys, has very low estimates (below 500 individuals; Amundin, 2016) and is considered critically endangered (Benke et al., 2014; Hammond et al., 2008; Scheidat et al., 2008).
As with other marine mammals in the Northern Atlantic, e.g. grey and harbour seals, subpopulations of the harbor porpoise arose during the end of the last glacial period as North Sea populations recolonized the Baltic Sea (Fietz et al., 2016). Now these different populations show shifts in habitat use based largely on food availability (Hammond et al., 2013) and activity patterns (Nuuttila et al., 2017), and display fine scale morphological and genetic differences (Fontaine et al., 2012, 2014; Wiemann et al., 2010) and significant isolation by distance (Lah et al., 2016). Recent studies based on morphometric and genetic data suggest that different ecotypes of harbour porpoise in the North Atlantic and Baltic Sea exist and may need further conservation measures (Fontaine et al., 2014, 2017; Galatius et al., 2012).
These fine scale differences in morphology and behavior may constitute local adaptation, yet the genes underlying such a potentially adaptive differentiation are still unknown and would be best investigated on a whole-genome scale. To examine this, there is a need for high quality genomic resources for this species. A genome will also allow for a broader investigation of population structure, demographic history, functional, and evolutionary questions, as has been shown for other cetacean species in recent studies (Foote et al., 2016; Keane et al., 2015; Nery et al., 2013; Sun et al., 2013; Yim et al., 2013; Zhou et al., 2013). To this end, a full genome will enable mapping of so far anonymous nuclear microsatellite (Wiemann et al. 2010) and SNP (Lah et al. 2016) loci, thus facilitating population genomic inference.
We present here the first de novo assembly of the full genome of the harbor porpoise, scaffolded with in vitro proximity ligation data (hereafter “Chicago” library), and draft-annotated to predict its coding proteins and their functions (Deposited at NCBI as BioProject: PRJNA417595 with BioSample-ID: SAMN08000480). We demonstrate chromosome-level homology with other Cetartiodactyla (Gatesy et al., 2013), and insight into past population dynamics using a Bayesian skyline plot (Li & Durbin, 2011).
Materials and Methods
DNA sampling
Tissue for whole genome sequencing came from a single individual from the Kattegat (Glommen - Falkenberg), Sweden (ID: C2009/02665). Muscle tissue was sampled in July 2009 from a by-caught female of probably young age (22.4kg, 110.5m), frozen, and transported to Potsdam, Germany for DNA extraction. Sample preparation and Genomic DNA isolation were performed following the Quiagen DNeasy Blood & Tissue Kit (Cat 69506, Hilden). Successful high molecular weight DNA-isolation was confirmed by Sanger sequencing of the mitochondrial control region, and visualization of fragment sizes of the entire extraction using the Tape Station (Agilent 2200, Santa Clara, CA 95051). By mtDNA sequencing, we verified that the analyzed specimen carried haplotype PHO7 (Tiedemann et al., 1996), indicative of the separated Beltsea population of the Kattegat/Western Baltic Sea region (Lah et al., 2016; Wiemann et al., 2010).
Genome sequencing and assembly
The draft de novo assembly was constructed from two libraries (insert sizes ca. 300 and ca. 500bp); sequenced in 125bp PE on the Illumina HiSeq 2500 at EUROFINS Genomics. Reads were trimmed using CUTADAPT v1.10 (Martin, 2011) and an initial assembly was made using SOAPDENOVO2 (Luo et al., 2015). DNA from the same sample was used by Dovetail Genomics for construction of a Chicago library (Putnam et al., 2016), and sequenced in 150bp PE reads on an Illumina NextSeq500 at the University of Potsdam. The draft assembly was then scaffolded with the Chicago library results for the final HiRise assembly, performed by Dovetail Genomics.
Presence of core, single copy, and orthologous genes was measured using CEGMA and BUSCO, run in the genome mode for the Laurasiatheria database (Simão et al., 2015). BLOBTOOLS was run to examine potential contaminants, based on divergence in GC-content and read coverage variation across the assembly (Laetsch & Blaxter, 2017).
Genome annotation
Genome annotation was performed by MAKER2 (Holt & Yandell, 2011) in two steps. MAKER2 makes use of different programs and draws from several lines of evidence. Prior to annotation, repetitive elements were soft-masked with REPEATMASKER (Smit et al., 2013-2015) using the te_protein repeat database (Smith et al., 2007). In the first MAKER2 run, three gene predictors were used: SNAP (Bromberg et al., 2008) was ab initio trained with the CEGMA results (Parra et al., 2007), GENEMARK-ES (Ter-Hovhannisyan et al., 2008) was run using an HMM produced by ab initio training on the whole P. phocoena genome, and AUGUSTUS was run using the presets for human, as is recommended for vertebrates (Stanke et al., 2004). Protein sequences, supplied as evidence were obtained from the complete SwissProt database (553,941 Proteins) plus NCBI entries of 184,527 proteins from eight different cetacean groups (On 20 March 2017, all hits to following keywords: “Balaenopteridae”, “Lipotes vexillifer”, “Neophocaena”, “Orcinus orca”, “Phocoena”, “Physeter catodon”, “Pontoporia blainvillei”, “Tursiops truncatus”).
For the second MAKER2 run, we created a new SNAP-HMM based on the first MAKER2 output, and ran it with the same parameters as the first run, exchanging only the SNAP HMM and excluding the protein evidence. The resulting CDS predictions were extracted from the final gff file, which was created by fathom implemented in SNAP (Bromberg et al., 2008). These gene predictions were further verified by a BLASTN search against the entire GENBANK non-redundant nucleotide sequence database (date downloaded 21.07.2017). Summary statistics were generated using GENOME ANNOTATION GENERATOR (Hall et al., 2014). We then used all CDS and their BLAST results in BLAST2GO (Goetz et al., 2008) to identify conserved protein domains with INTERPROSCAN (including a Pfam comparison). We functional annotated the CDS with GO terms, which are a controlled vocabulary to describe gene function constantly actualized by the Gene Ontology Consortium (Ashburner et al., 2000; Carbon et al., 2017).
Comparative genomics
The closest relative with a chromosome-level assembly currently available is the domestic cattle, Bos taurus. To validate our assembly, we compared our scaffolds to the B. taurus chromosomes (assembly UMD 3.1.1 downloaded from NCBI, ACCESSION DAAA00000000). Specifically, the 122 P. phocoena scaffolds of at least 1Mbp were aligned to the B. taurus chromosomes using the nucmer software of the MUMMER package v. 3.23 (Kurtz et al., 2004). From the coordinates of these alignments, runs of ten or more consecutive matches of each at least 250bp between a given P. phocoena scaffold and a B. taurus chromosome were extracted using custom perl scripts. Their start and end positions were used to generate a CIRCOS (http://circos.ca/) plot that shows regions of collinearity as well as rearrangements. For the CIRCOS plot, separate ribbons are displayed between a B. taurus chromosome and a P. phocoena scaffold for consecutive hits that were each no more than 20,000 bp apart. If a hit is more than 20,000bp from the next run of consecutive hits, a new ribbon was started; in total 24,394 separate ribbons were constructed (Figure 1).
Population genomics
In using genome-wide diploid sequence data it is possible to reconstruct the population history in estimating population sizes through the past (Li et al., 2011). To estimate the demographic history of the individual sequenced, we used the SNP Frequency spectra based on our genome assembly, which is a haploid sequence, and the PE reads used to construct the de novo assembly, prior to Chicago scaffolding (described above, we used both insert sizes). These reads were first mapped back to the final assembly using BWA (Li & Durbin, 2009). SNP data was extracted from the resulting bam files, and variants were extracted using SAMTOOLS vs.1.6. (Li, Handsaker, et al., 2009), and BCFTOOLS (Li, Handsaker, et al., 2009), implemented with the script vcfutils.pl (Li, Handsaker, et al., 2009). This generated a final *.fq.gz file, which was then used to generate the final Bayesian skyline plot in the PSMC package, using perl scripts psmc2history.pl and psmc_plot.pl (Li et al., 2011). The parameters of the PSMC analysis were set following the recommendation from the authors (Li & Durbin, 2011, https://github.com/lh3/psmc) and we applied a generation time of 10 years (Birkun Jr. & Frantzis, 2008) and a mutation rate of 2.2 × 10−9 year/site (Taylor et al., 2007).
Results
De novo assembly of the P. phocoena genome
Shotgun sequencing produced a total number of 1,268M reads (Table 1), these were used to generate a draft assembly with 2.4M scaffolds and an N50 of 33.1kb. This assembly was combined with the Chicago library data (556M read) for final scaffolding by Dovetail Genomics (Putnam et al., 2016). The final HiRise assembly from Dovetail contains ca. 2M scaffolds (Table 2) and has a total length of 2.7Gb (N50 of 23.8Mb). The greatest improvements from the addition of the Chicago libraries is in building up the 34 longest scaffolds, which make up approximately half of the entire assembly (Table 2). The CIRCOS plot illustrates the near-completeness of these long scaffolds. We observe almost complete coverage of the cow chromosomes by scaffolds bigger than 1Mb in our assembly (Figure 1). The BUSCO and CEGMA analyses also suggests that we have largely reconstructed the entire genome, and identified 96.9% (91.3% complete) of the 2,586 Eukaryotic and 94.2 (88.7% complete) of the 6,253 Laurasiatheria BUSCO core genes and 90% of the 248 ultra-conserved CEGs (54% complete).
Genome completeness and annotation
The MAKER2 annotation resulted in the prediction of 22,154 coding genes (Table 3). In total 21,750 CDS had a BLAST hit against the nucleotide database, which accounts for 98% of the total CDSs. Of these BLAST hits, 99% account for vertebrate, and these were dominated (90%) by hits to Cetacea (thereof 59% Tursiops truncatus, 27% Orcinus orca). Further annotation with INTERPROSCAN revealed 250,126 features of these predicted proteins. These comprise hits in several protein domain databases, e.g. 23,319 PFAM protein domains, 37,046 PANTHER gene families, 24,538 SUPERFAMILY annotations and 31,114 GENE3D domains. Assignment of the BLAST results to Gene Ontology (GO) categories resulted in 55,143 hits across the GO categories (Figure 2).
Inference of Kattegat/Baltic population history
We inferred the population history of the harbour porpoise P. phocoena based on one single individual (Li et al., 2011) using the PSMC algorithm, which combines all generated PE read data generated. Between eight and four million years ago the inferred population size (Ne) was low, around 10,000 individuals (Figure 3). It began to increase slightly at 3Myr, and rose more rapidly around 2Myr, reaching an Ne of 45,000 during the following 1.5 Myr. The estimated population size peaked approximates 400kyra before it dropped to a quarter of the original size around 100kyrs ago, leading to a very low Ne, similar to that seen in present day populations (Hammond et al., 2013).
Discussion
We present here a high quality de novo genome assembly for the harbour porpoise Phocoena phocoena. With a GC-content of 41.4% and a total length of 2.7 GB, this assembly is comparable to other high quality genomes (Groenen et al., 2013; Zimin et al., 2009). BUSCO and CEGMA gene scans support a near completeness of core genes in the assembly, and support that we have largely reconstructed the entire genome. For almost completely covering the chromosomes of the B. taurus genome (Figure 1), only 122 scaffolds are needed, including the 34 largest scaffolds representing 50% of the whole genome. Of these largest scaffolds some completely match single B. taurus chromosomes, e.g., chromosome 25. Other B. taurus chromosomes are in only 2-3 pieces in our scaffolds, e.g. chromosomes 12, 24. Based on this comparison, we infer that our assembly represents a nearly complete genome of P. phocoena, and that our largest scaffolds are nearly-complete chromosomes. The CIRCOS plot also illustrates chromosomal rearrangements between domestic cattle and the harbour porpoise, two species diverged approximately 60Myrs ago within the Cetartiodactyla (Gatesy et al., 2013). These chromosomal rearrangements are seen several times among distinct lineage of Cetartiodactyla (Avila et al., 2015; Kulemzina et al., 2009, 2011; Pauciullo et al., 2014), e.g., comparison between camel, pig and domestic cattle (Balmus et al., 2007).
The number of annotated genes (22,154) is comparable to other published cetacean genomes: 21,459 bottlenose dolphin (Lindblad-Toh et al., 2011), 20,605 minke whale (Yim et al., 2013), 22,711 grey whale (DeWoody et al., 2017). They appear to broadly span key functional gene categories, e.g. biological processes, cellular components and molecular function, both across the annotated GO terms and the INTERPROSCAN analysis. With this information we can directly search for known, respectively key genes, for further investigations, e.g. selection or adaptive traits.
The harbor porpoise is estimated to have split from is closest relative ca. 5Myr ago (Gatesy et al., 2013). Interestingly our Bayesian skyline plot (Figure 3) coincides with this date by starting a population expansion around that time point. Around 4.5 Myr ago an expansion occurred, during which time the North Atlantic is known to have cooled, leading to an extinction of 65% of the marine organisms (Stanley, 1995). The harbour porpoise is well known in subarctic regions and some populations (e.g. Greenland) occur in areas which freeze to a large extent during winter (Tolley & Rosel, 2006). Therefore, an extinction of other marine species during a cold water period does not preclude that the harbour porpoise could increase its population size and expand through the Atlantic. During the last interglacial period, Eemian, the inferred Ne remained relatively high at around 50,000 individuals before, dropping dramatically with the beginning of the last glacial period 100kya. When comparing this pattern to the demographic history of other cetaceans, it is most similar to the bottlenose dolphin (Tursiops truncatus), a related species with a similar North Atlantic distribution (Brüniche-Olsen et al., n.d.; Foote et al., 2016; Yim et al., 2013; Zhou et al., 2013). The newly forming sea ice areas, around 400kya ago, could have led to fragmentation of different populations, and therefore lead to a drop in regional total effective population size in regards to our sample. A potential low population size we see postulated for today would fit to the history of the Baltic Sea and the population status of P. phocoena (Johannesson et al., 2011; Johannesson & André, 2006; Ukkonen et al., 2014). Specifically, there is strong evidence for a Western Baltic/Kattegat (i.e., Beltsea) population separated from the North Sea/North Atlantic (Hammond et al., 2013; Lah et al., 2016), which currently counts approximately 40,000 animals (Benke et al., 2014; Scheidat et al., 2008; Viquerat et al., 2014). Our sequenced specimen was assigned with high likelihood to this Beltsea population by mtDNA analysis (exhibiting haplotype PHO 7; cf. Tiedemann et al., 1996; Wiemann et al., 2010).
In this study we present the first whole genome assembly and annotation of the harbour porpoise, at this point the most complete assembly for the Family Phocoenidae. This genome adds to the Cetacean genome collection by supplying important resources for further investigation within the Odontoceti as well as outside the Cetacea. This will provide an invaluable resource for further genetic studies within the harbour porpoise itself, both as a resource for whole-genome investigations into population structure and to identify key genes associated with local adaptation. This genome represents a crucial genetic resource for further investigation in the population genetics and phylogeny on other species of the Phocoenidae including the currently most rare marine mammal, the almost extinct Vaquita (Phocoena sinus) (Taylor et al., 2017), and is hence especially important for conservation efforts.
Data Accessibility
The genome assembly, finale genome sequence and the draft annotation are deposit on NCBI under BioProject-ID: PRJNA417595 and BioSample-ID: SAMN08000480).
Authors Contributions
R.T. and L.L. designed the study; A.R. provided the sample and associated biological information. L.L. performed molecular lab work, S.H. performed initial de novo assembly, M.A. executed all genome annotations and analyses, M.A., S.H., and, A.B.D. analyzed and interpreted the results, M.A. wrote the manuscript. All authors edited and approved the final manuscript.
Acknowledgments
Financial support came from the Bundesamt für Naturschutz (FKZ # 3514824600), as part of a larger study of population genomics. We thank Prof. Dr. Michael Hofreiter for providing access to the Illumina NextSeq platform. Additional support came from the University of Potsdam. Large-scale computational effort was made possible by computing resources provided by the department of Genetics at University of Potsdam and the High Performance Computing Cluster Orson2, managed by ZIM (Zentrum für Informationstechnologie und Medienmanagement) at the University of Potsdam.