Abstract
Adopting a new diet is a significant evolutionary change and can profoundly affect an animal’s physiology, biochemistry, ecology, and its genome. To study this evolutionary transition, we investigated the physiology and genomics of digestion of a derived herbivorous fish, the monkeyface prickleback (Cebidichthys violaceus). We sequenced and assembled its genome and digestive transcriptome and revealed the molecular changes related to important dietary enzymes, finding abundant evidence for adaptation at the molecular level. In this species, two gene families experienced expansion in copy number and adaptive amino acid substitutions. These families, amylase, and bile salt activated lipase, are involved digestion of carbohydrates and lipids, respectively. Both show elevated levels of gene expression and increased enzyme activity. Because carbohydrates are abundant in the prickleback’s diet and lipids are rare, these findings suggest that such dietary specialization involves both exploiting abundant resources and scavenging rare ones, especially essential nutrients, like essential fatty acids.
Main
Populations exposed to new environments often experience strong natural selection (e.g., Herrel et al. 2008). Comparing closely related species has been an effective perspective in pinpointing changes that drive adaptation (Dasmahapatra et al. 2012; Lamichhaney et al. 2015). In its most powerful incarnation, the comparative method links variation at the genetic level to molecular phenotypes that in turn change how whole organisms interact with their environments, revealing for example adaptation to new abiotic factors (Protas et al. 2006; Chakraborty and Fry 2015; Peichel and Marques 2017; Tong et al. 2017; Chen et al. 2018), changes in diet (Harris and Munshi-South 2017; Hsieh et al. 2017; Zepeda-Mendoza et al. 2018), and exposure to pollution (Vega-Retter et al. 2018). In animals, digestion is an ideal model phenotype because it is central to fitness, is understood in many species at genetic, molecular, biochemical, and physiological levels, and is a trait that shows abundant variation throughout animal evolution (Karasov and Martinez del Rio 2007). While studies of animal digestion have yielded insights into the physiology and biochemistry of adaptation, untangling its genetic basis is contingent on the availability and quality of genomic resources, which have traditionally been lacking in nonmodel species. Advances in genome technology have improved the quality and decreased the cost of obtaining sequences, stimulating the production of genome assemblies and catalyzing genetic discoveries for a diverse array of non-model organisms.
Here we describe the genetic changes accompanying acquisition of an herbivorous diet in the marine intertidal fish, Cebidichthys violaceus. To do this we generated a physiological genomics dataset for this non-model species, including a highly contiguous and complete genome, the transcriptomes of digestive and hepatic tissues, and digestive enzyme activity levels. We generated a reference quality genome for C. violaceus with an N50 of 6.7 MB, placing it among the most contiguous teleost assemblies (Fig. 1). The ecological and evolutionary positions of C. violaceus makes these resources particularly suited to unraveling the acquisition of herbivory. Members of the family Stichaeidae, including C. violaceus, have independently invaded the intertidal zone multiple times, and in two cases, began consuming significant amounts of algae (Fig. 2; Kim et al. 2014), a diet that is low in protein and lipids and rich in fibrous cell walls (Horn et al. 1986; German et al. 2015). In addition to this convergent acquisition of herbivory, these intertidal invaders experience extremes of environment, including osmotic variability, wide temperature fluctuations, and aerial exposure.
a, Illustration of the monkeyface prickleback (Cebidichthys violaceus). b, A maximum likelihood (ML) tree was constructed with 1,000 bootstrap replicates in PhyML v3.1 based on the lowest average gaps present in our ortholog cluster alignments of concatenated 30 protein coding genes from all 13 taxa. C. violaceus (highlighted in a gray box) along with 12 fish taxa with sequenced genomes from Ensembl (Release 91) were used for our phylogenetic analyses. The Ensembl taxa included: Poecilia formosa, Xiphophorus maculatus, Oryzias latipes, Oreochromis niloticus, Gasterosteus aculeatus, Tetraodon nigroviridis, Takifugu rubripes, Gadus morhua, Astyanax mexicanus, Danio rerio, Lepisosteus oculatus, and Latimeria chalumnae. Genome size and contig N50, and BUSCO v3 complete genes identified out of 2586 BUSCO groups are represented for all taxa.
Bayesian posterior probabilities are indicated on nodes. Cebidichthys violaceus is bolded, and photos of C. violaceus and other studied taxa are shown with their digestive systems beneath their bodies. Note the differences in gut size. H=herbivory, O=omnivory, C=carnivory. Evolution of herbivory (— — — —) and omnivory (…………) are shown. Numbers in parentheses show number of taxa evaluated at that branch. Boxes highlight alleged families or subfamilies within the polyphyletic family Stichaeidae, with Cebidichthyidae (top), Xiphisterinae (middle), and Alectriinae (bottom) all highlighted. Hindgut short chain fatty acid (SCFA) concentrations are mean ± standard deviation, and were compared with ANOVA (F3,33 = 127.92; P < 0.001). SCFA data from German et al. (2015).
Herbivory is poorly represented among high quality teleost genomes (Fig. 1; Supplementary Table S1). Because teleosts are so speciose, they represent a large number of independent acquisitions of herbivory, even though only 5% of teleosts are considered nominally herbivorous (cf 25% for mammals; Choat and Clements 1998). Even the term “herbivorous” is controversial amongst ichthyologists (Clements et al. 2017). Among nominally herbivorous fishes, most do not specialize on algal thalli like C. violaceus does. Moreover, just within the Stichaeidae, it is clear that C. violaceus digests red and green algae with the aid of microbial symbionts in their hindguts, as evidenced by elevated levels of short chain fatty acids, or SCFAs, in this gut region. This microbial symbiosis is analogous to other highly specialized vertebrate herbivores like lagomorphs or rodents, whereas the other stichaeid herbivore, X. mucosus, is less- reliant on such microbes (Fig. 2; German et al. 2015). Thus C. violaceus offers a unique opportunity to study extremes of dietary specialization. Although herbivorous and omnivorous animals tend to have elevated amylase activities in their guts (Perry et al. 2007; German et al. 2010; Kohl et al. 2011; Axelsson et al. 2013; Boehlke et al. 2015; German et al. 2015), and achieve these activities via gene duplications or elevated expression of fewer amylase genes (German et al. 2016), the appreciation of bile-salt activated lipase in the digestive process of animals eating a low-lipid, high-fiber food, is more recent (German et al. 2004; German et al. 2015; Leigh et al. 2018). Here we describe extensive structural variation at the gene level and adaptive variation at the amino acid level in both of these gene families, suggesting multiple mutational mechanisms for the acquisition of a novel derived dietary physiology in C. violaceus.
Results and Discussion
Genome Assembly, Quality, and Size
We generated ~30Gb long reads (~37X based on genome size = 792 Mb for C. violaceus) (Hinegardner and Rosen, 1972) using Pacific Biosciences (PacBio) Single Molecule Real Time sequencing, and 84.5Gb (~107X) paired end Illumina reads. A draft assembly using the Illumina data alone was highly fragmented (N50 = 2760 bp), consistent with most published fish genome assemblies [Zerbino et al. 2017; www.ensembl.org]. Using the Illumina contigs in concert with long reads yielded a markedly more contiguous hybrid assembly (N50 =2.21 Mb). An assembly of the long reads alone yielded a similarly contiguous genome (N50 = 2.45 Mb). Finally, merging of the hybrid assembly with the long read only assembly (Chakraborty et al. 2016; see Methods) yielded a highly contiguous assembly (N50 = 6.69Mb), ranking it among the most contiguous teleost genomes, and, to our knowledge, the most contiguous among herbivorous fishes. Assessment of the universal single copy orthologs (BUSCOs) (Waterhouse et al. 2017) show that completeness of our assembly (97%) is comparable to or even better than the model reference fish genomes (86.6-98.3 %) (Fig. 1B). By using JELLYFISH we estimated the genome size, based on an average of four k-mer size counts (25, 27, 29, 31), at 656,598,967 base pairs based with a standard deviation of 4,138,853 base pairs (Supplementary Figure S5; Supplementary Table S4), which is close to the original genome size estimate of 792 Mb based on the c-value (Hinegardner and Rosen, 1972), and for other fish genomes (Fig. 1B; Zerbino et al. 2017).
Consistent with phylogenetic considerations, the genome of C. violaceus shares the most similarity with that of stickleback (Supplementary Figure S16). When we compare our assembled genome to those of Lepisosteus oculatus, Danio rerio, and O. latipes, large sections of syntenic regions are conserved (Supplementary Figure S16-19). Nevertheless, these comparisons show the relative completeness of our draft genome in comparison to these model systems.
Physiological genomics of digestive enzymes
The activity levels of digestive enzymes reveal what substrates are readily digested in an animal’s digestive tract, highlighting the enzyme genes that are potential targets of selection for efficient digestion (Karasov and Martínez del Rio 2007). Within the field of Nutritional Physiology, the Adaptive Modulation Hypothesis predicts a match between the amount of an ingested substrate (e.g., starch) and digestive enzyme expression and activity levels to digest such substrates (e.g., amylase) based on economic principles (Karasov and Martinez del Rio 2007). Indeed, pancreatic amylase activity tends to be elevated in the guts of herbivores and omnivores in comparison to carnivores (especially in prickleback fishes; Chan et al. 2004; German et al. 2016), matching the higher intake, and the importance, of soluble carbohydrates to these animals (Horn et al. 2006; Kohl et al. 2011; German et al. 2010; Axelsson et al. 2013; Boehlke et al. 2015; German et al. 2015; German et al. 2016). Our assembly reveals three tandem pancreatic amylase genes in the C. violaceus genome: two copies of amy2a and one copy of amy2b (Fig. 4). The three amy genes in tandem differs from other pricklebacks (and indeed, most other fishes for which genetic data are available), which tend to have one or two identical copies of amy2 (Fig. 4; German et al. 2016). The two amy2a copies are supported by three spanning reads, emphasizing the correct assembly of the amy2a tandem duplicates (Supplementary Figure S12). Each amylase gene is preceded by a 4.3Kb DNA element encoding a transposase (Fig. 4, Supplementary Figure S13), hinting at a role of this TE in gene duplications in this region (Feschotte and Pritham 2007; McVean 2010; Pantzartzi et al. 2018). Additionally, the amy2B gene has a 2,025 bp LINE element inside the 2nd intron of the gene. All three copies of amylase gene copies possess a ~470bp fragment of a long terminal repeat retrotransposon (Supplementary Figure S13). Insertion of the TEs proximal to the transcription start site and within first introns could modulate the expression of the amylase gene copies because both of these regions are typically enriched with cis-regulatory elements [Feschotte, 2008]. When testing all 11 branches for the seven prickleback taxa, only one branch was under episodic diversifying selection (C. violaceus, amy2b) with a significant uncorrected p-value of 0.0044 (Figure 4B). We do not see this pattern of positive selection in any of the other branches with significant p-values. We observed three sites with episodic positive selection with a p-value under 0.05 (sites: 41, 256, and 279; Figure 4D). How the AMY2A and AMY2B proteins differ in function requires further investigation, but they have different isoelectric points (7.86 vs. 8.62; German et al. 2016), which hints that they may be more active in different parts of the gut, and indeed, the transcriptomic data show that amy2a is expressed at a fairly constant level throughout the proximal GI tract (including pancreatic tissue in the pyloric ceca), whereas amy2b is expressed mostly in the mid intestine (Fig. 3).
We used brain, gill, gonads (testes), heart, liver, pyloric caeca (PC), proximal intestine (PI), and middle intestine (MI), and spleen tissues from C. violaceus to represent the transcriptome. Only gene expression profiles of liver, PC, PI, MI and spleen are represented for Glycolytic, Lipid metabolism/Gluconeogenesis, Ketone Degradation, Glucosidases, Proteases, and Lipases. Low to high expression is shown on a gradient scale from violet to yellow respectively. Unit of expression is measured as Fragments Per Kilobase of transcript per Million mapped reads (FPKM). a, Three heatmaps were generated for transcripts which belong to glycolytic pathways (blue box), transcripts for enzymes associated with gastrointestinal fermentation based on Willmott et al. 2005 study in marine teleost fishes (green box), and kentone degradation pathway transcripts (red/pink box). b, Three heatmaps of digestive enzymes were generated which include carboyxlases (alpha-glucosidases and beta glucosidases, blue boxes), proteases (green box), and lipases (red/pink box).
a, Total gut standardize enzymatic activity for Cebidichthys violaceus (Cv) and other intertidal stichaeid species: Phyticus chirus (Pc), Xiphister mucosus (Xm), Xiphister atropurureus (Xa), and Anoplarchus purpurescens (Ap). H = herbivory, O = Omnivory, and C = Carnivory. Values are mean ± standard error with n = 6 for Cv, Xm, Xa, and Ap; and n = 9 for Pc (German et al. 2015). Interspecific comparisons were made for amylase with ANOVA, where circles that share a letter are not significant. b, Synteny map for amylase genes from Danio rerio, Oryzias latipes, Gasterosteus aculeatus, and C. violaceus. * denotes that D. rerio also has amylase loci present on chromosome 17. c, An adaptive branch-site Random Effects Likelihood (aBSREL) test for episodic diversification phylogenetic tree constructed for amylase genes from C. violaceus (amy2a and amy2b) and other intertidal stichaeid species. ω is the ratio of nonsynonymous to synonymous substitutions. The color gradient represents the magnitude of the corresponding ω. Branches thicker than the other branches have a P<0.05 (corrected for multiple testing) to reject the null hypothesis of all ω on that branch (neutral or negative selection only). A thick branch is considered to have experienced diversifying positive selection. d, The output of Mixed Effects Model of Evolution (MEME) to detect episodic positive/diversifying selection at sites. β+ is the non-synonymous substitution rate at a site for the positive/neutral evolution throughout the sequence of the gene. ** is an indication that the positive/diversifying site is statistically significant with a p-value < 0.01 and * is for p-value < 0.05.
Interestingly, amylolytic activities in the guts of C. violaceus are similar to those in the two species of Xiphister (Figs. 2 and 4a), yet the Xiphister taxa only have two copies of amy2a (German et al. 2016), and C. violaceus and X. mucosus digest algal starch with similar efficiencies (Horn et al. 1986). Thus, the phenotype of elevated amylase activity can be achieved via different mechanisms (increased gene copy number leading to increased expression of the genes vs increased expression of fewer genes) with similar performance outcomes at the whole animal level (Horn et al. 1986; German et al. 2016).
Herbivores have the challenge of consuming a food that is simultaneously low in lipid (Neighbors and Horn 1991), and high in fiber (Painter 1983), and fiber binds to lipid, impeding its digestion (German et al. 1996). Thus, Bile Salt Activated Lipase (BSAL; Murray et al. 2003) represents another important digestive enzyme for herbivorous fishes because lipids (especially essential fatty acids) are essential to survival. The importance of BSAL in herbivores has been validated in prickleback fishes (German et al. 2004; German et al. 2015), as well as in Danio rerio fed a high-fiber, low-lipid diet analogous to an herbivorous diet in the laboratory, which elicited elevated lipase activities in this species (Leigh et al. 2018). Thus, it appears that C. violaceus, the algal-consuming Xiphister taxa, and other herbivorous species (German et al. 2004; Leigh et al. 2018) invest in lipase expression to ensure lipid digestion from their algal diet, consistent with what is known as the Nutrient Balancing Hypothesis, or that animals can invest in the synthesis of digestive enzymes to acquire limiting nutrients (Clissold et al. 2010), in this case, lipids.
In the C. violaceus genome, we identified four tandem copies of Bile Salt Activated Lipase (BSAL) on contig 445 (Figure 5B). Two of these (BSAL-1b and BSAL-1c) are more similar to each in intron/exon arrangements as compared to BSAL-1a and BSAL-2, suggesting that BSAL-1b and BSAL-1c copies originated from more recent gene duplications (Supplementary Figure S14). Interestingly, BSAL-1a and BSAL-2 possess two different LINE elements in their first and last introns, respectively, contributing to the structural diversification of the two gene copies (Supplementary Figure S14). In addition, there is evidence of recombination in BSAL, and there is one breakpoint (with a p-value of 0.01) with significant topological incongruence at site 534. To examine whether BSAL-2 diverged under selection, we estimated selection with 11 taxa and found three branches out of 19 to be under episodic diversifying selection: C. violaceus BSAL2 (uncorrected p-value 0.0008), Node 9 (uncorrected p-value 0.00001; which leads to the clade containing BSAL genes of Phytichthys chirus and the two species of Xiphister), and X. mucosus BSAL2 (uncorrected p-value 0.00001) (Fig. 5C). Interestingly, these all lead to taxa that consume algae, and hence, a lower lipid diet. In the lipase genes, there are 14 sites under episodic positive selection (p-value < 0.05): site 92, 94, 115, 139, 144, 151, 156, 160, 356, 429, 436, 443, 460, and 475 (Figure 5D). Consistent with their functional divergence in protein sequence, BSAL-1 and BSAL-2 show different tissue-specific expression patterns: BSAL-1a is expressed ubiquitously, whereas BSAL-2 is primarily expressed in pyloric caeca, spleen, and intestine [Fig. 3B]. Furthermore, even though the protein sequence of BSAL-1 copies are similar (pairwise distance of 0.014-0.054; poisson corrected), BSAL-1b is expressed largely in the gills and the heart, providing evidence for subfunctionalization in the BSAL-1 gene copies. Whereas C. violaceus has four copies of BSAL in their genome, other fishes for which the genome has been sequenced, and other stichaeids (based on transcriptomes of their relevant tissues), appear to have two BSAL genes (Fig. 5). These extra lipase genes may have real consequences for in vivo lipid digestibility, as C. violaceus digests algal lipid with consistent efficiency across a range of lipid concentrations, whereas X. mucosus shows decreasing lipid digestibility with decreasing lipid content in their diet (Horn et al. 1986). Humans who evolutionarily inhabited high latitudes in the Northern Hemisphere are known to consume lipid-rich diets, and genomic scans of some of these populations (Nganasans and Yakuts) shows that they have experienced selection on lipases and proteins involved in lipid metabolism (Hsieh et al. 2017). Much like amylase activities being elevated in herbivores and omnivores, selection on lipid digestion and metabolism in animals (in this case, humans) consuming high-lipid diets concurs with the Adaptive Modulation Hypothesis (Karasov and Martinez del Rio 2007), which is the opposite of the Nutrient Balancing Hypothesis (Clissold et al. 2010) that we propose for selection on lipase in C. violaceus. Thus, specific pathways can be selected upon for different reasons: some to ensure adequate digestion and metabolism of an abundant resource (Adaptive Modulation Hypothesis), the other for the acquisition of a limiting one (Nutrient Balancing Hypothesis).
a, Total gut standardize enzymatic activity for Cebidichthys violaceus (Cv) and other intertidal stichaeid species: Phyticus chirus (Pc), Xiphister mucosus (Xm), Xiphister atropurureus (Xa), and Anoplarchus purpurescens (Ap). H = herbivory, O = Omnivory, and C = Carnivory. Values are mean ± standard error with n = 6 for Cv, Xm, Xa, and Ap; and n = 9 for Pc (German et al. 2015). Interspecific comparisons were made for BSAL with ANOVA, where circles that share a letter are not significant. b, Synteny map for BSAL genes from Danio rerio, Oryzias latipes, Gasterosteus aculeatus, and C. violaceus. c, An adaptive branch-site Random Effects Likelihood (aBSREL) test for episodic diversification phylogenetic tree constructed for BSAL genes from C. violaceus and other intertidal stichaeid species. ω is the ratio of nonsynonymous to synonymous substitutions. The color gradient represents the magnitude of the corresponding ω. Branches thicker than the other branches have a P<0.05 (corrected for multiple testing) to reject the null hypothesis of all ω on that branch (neutral or negative selection only). A thick branch is considered to have experienced diversifying positive selection. d, The output of Mixed Efffects Model of Evolution (MEME) to detect episodic positive/diversifying selection at sites. β+ is the nonsynonymous substitution rate at a site for the positive/neutral evolution throughout the sequence of the gene. ** is an indication that the positive/diversifying site is statistically significant with a p-value < 0.01 and * is for p-value < 0.05.
Transcriptomics
We constructed a transcriptomics dataset for nine tissues (Supplemental Figures S8-S11) in C. violaceus, focusing here on the liver, pyloric ceca (which includes pancreatic tissue; Kim et al. 2014; German et al. 2016), mid intestine, and spleen (Fig. 3). In these four tissues, we identified genes associated with metabolism (glycolysis, ketone metabolism, and fermentation) (Fig. 3A), and digestion (Fig. 3B). Like herbivorous mammals, C. violaceus has an active microbial community in their hindgut that ferments dietary substrates to short chain fatty acids (SCFAs)(Fig. 2; German et al. 2015). These SCFA are then absorbed by the host and used to generate ATP in various tissues (Bergman 1990; Karasov and Martinez del Rio 2007). Because SCFAs are ketones, animals reliant on microbial fermentation in their hindguts also have active ketotic pathways in their tissues (Karasov and Martinez del Rio 2007; Willmott et al. 2005). Indeed, genes coding for proteins that are part of ketone synthesis and degradation are upregulated in most tissues in C. violaceus, but especially in the liver (Fig. 3A), consistent with the digestive strategy utilizing microbial symbionts in this herbivorous fish (German et al. 2015). In addition to amylase and lipase (discussed above), there are clear expression patterns for carbohydrases, proteases, and lipases, confirming the suite of enzymes necessary to digest a range of nutrients (Fig. 3B). Interestingly, C. violaceus appears to express three separate chymotrypsin genes and only one trypsin gene. This may be consistent with their consumption of algal material, as carnivores (e.g., salmon) may invest more in trypsin expression (Rungruangsak-Torrissen et al. 2006), whereas herbivores (e.g., grass carp) may express more chymotrypsin (Gioda et al. 2017).
Interestingly, we found that the pyloric cecal tissue of C. violaceus, which is frequently recognized as “pancreatic” (e.g., German et al. 2016) because it is sheathed in acinar cells (Kim et al. 2014) and shows elevated activity levels of pancreatic enzymes (German et al. 2015), only had two differentially expressed genes in comparison to the mid intestine (Supplementary Figures S9-S11), which is known to be the highly absorptive region of the fish gut (Buddington et al. 1987; Tengjaroenkul et al. 2000). The pyloric ceca have been documented to be absorptive (Buddington and Diamond 1987), meaning they can be functionally similar to the mid intestine, but the mid intestine is rarely recognized as also having pancreatic function. The pancreatic tissue of fish does not generally form a distinct organ, as it does in mammals. In fishes, it tends to be embedded in the liver (forming a hepatopancreas) or diffuse along the intestine (Clements and Raubenheimer 2006). Our transcriptomic data show not only a shared absorptive function of the pyloric caeca, but also that the acinar cells are distributed along the intestine and not restricted to the pyloric cecal region (Supplementary Figures S10-S11). Indeed, the mid intestine shows expression of numerous pancreatic enzymes, the activities of which are detectable in mid intestine tissue of prickleback fishes (German et al. 2015). Thus, our transcriptomic and biochemical data strongly suggest that the pyloric ceca are absorptive (as suggested by Buddington and Diamond 1987) and that pancreatic tissue is diffuse along the intestine in prickleback fishes.
Conclusions
Our genomic efforts produced a highly contiguous fish genome, and because of the rich history of investigation into the nutritional physiology of C. violaceus and other stichaeid fishes, we were able to analyze our genomic data in the context of digestive physiology. Our results are unique in that we show that both, the Adaptive Modulation Hypothesis (Karasov and Martinez del Rio 2007), and the Nutrient Balancing Hypothesis (Clissold et al. 2010), can have genetic underpinnings within the same organism. This powerful physiological genomics approach will allow us to tease apart more aspects of dietary adaptation in C. violaceus moving forward and to provide a model for nutritional physiological research. Given that C. violaceus is an important member of Marine Protected Areas on the west coast of the United States, and is targeted for aquaculture in northern California where it is a delicacy, our data will also have application for conservation and better culturing techniques.
METHODS
Collection and Preparation
One individual of Cebidichthys violaceus with a standard length of 156mm was collected in May 2015 by hand during low tide from San Simeon (35.6525N, 121.2417W), California. The individual was euthanized in MS-222 (1g/L), dissected to remove internal organs, decapitated, and preserved in liquid nitrogen. We used 1.21g of skin and muscle tissue to extract genomic DNA using the Genomic DNA and RNA purification kit provide by Macherey-Nagel (Düren, Germany) with three NucleoBond® AXG columns. After extraction, the DNA samples were sheared with a 1.5 blunt end needles (Jensen Global, Santa Barbara, CA, USA) and run through pulse field gel electrophoresis, in order to separate large DNA molecules, for 16 hours. Samples were checked with a spectrophotometer (Synergy H1 Hybrid Reader, BioTek Instruments Inc., Winooski, VT) and Qubit® fluorometer for quantification of our genomic DNA. We implemented two methods of high throughput sequencing techniques, Pacific Biosciences (PacBio) and Illumina platforms. For PacBio sequencing, genomic DNA was sized selected with BluePippin™ with a 15kb size cutoff, and 40 SMRT cells were sequenced with the PacBio RS II. In addition, from the same gDNA extraction, a multiplex gDNA-Seq Illumina sequencing library was prepared from size selected fragments which ranged from 500-700bp and sequenced on two lanes that resulted in short reads (100bp paired-end) on an Illumina HiSeq 2500. All genomic sequencing was completed at the University of California, Irvine (UCI) Genomics High- Throughput Facility (GHTF).
Illumina and Pacbio Hybrid Assembly
We implemented multiple bioinformatic assembly programs to generate our final assembly of the C. violaceus genome with the following order of programs (Supplementary Figure S1; Supplementary Table S2). All computational and bioinformatics analyses were conducted on the High Performance Computing (HPC) Cluster located that University of California, Irvine. Sequence data generated from two lanes of Illumina HiSeq 2500 were concatenated and raw sequence reads were assembled through platanus v1.2.1 (Kajitani et al. 2014), which accounts for heterozygous diploid sequence data. Parameters used for platanus was -m 256 (memory) and -t 48 (threads) for this initial assembly. Afterwards, contigs assembled from platanus and reads from 40 SMRT cells of PacBio sequencing were assembled with a hybrid assembler dbg2olc v1.0 (Ye et al. 2014). We used the following parameters in dbg2olc: k 17 KmerCovTh 2 MinOverlap 20 AdaptiveTh 0.01 LD1 0 and RemoveChimera 1. Without Illumina sequence reads, we also conducted a PacBio reads only assembly with Falcon (https://github.com/PacificBiosciences/FALCON) with default parameters in order to assemble PacBio reads into contiguous sequences. The parameters we used were input type = raw, length_cutoff = 4000, length_cutoff_pr = 8000, with different cluster settings 32, 16, 32, 8, 64, and 32 cores, concurrency setting jobs were 32, and the remaining were default parameters. After our falcon assembly, we used the outputs from falcon and dbg2olc as input for quickmerge v1.0 (Chakraborty et al. 2016), a metassembler and assembly gap filler developed for long molecule-based assemblies. Several different parameters in quickmerge v1.0 were conducted as suggested by the authors until an optimal assembly was obtained with HCO 10 C 3.0 -l 2400000 -ml 5000. Afterwards, we polished our assembly with two rounds of QUIVER (Chin et al. 2013) and then finally using PILON v1.16 (Walker et al. 2014) to make improvements to the final genome assembly. N25, N50, and N75 was estimated with a perl script (Joseph Fass - http://bioinformatics.ucdavis.edu) from our final genome assembly of C. violaceus. With the final genome assembly, we processed our genome through repeatmasker v.4.0.6 (Smit et al. 2015) to mask repetitive elements with the parameter -species teleostei.
Genome Size Estimation and Quantitative Measure of Genome Assembly
The c-value has been estimated for Cebidichthys violaceus (Hinegardner and Rosen, 1972), which is 0.81. Based on this c-value, the estimate of the genome size is ~792 Mb. In addition, we estimated the genome size using only Illumina sequences by using JELLYFISH v2.2.0 (Mar9ais and Kingsford, 2011). We selected multiple k-mers (25, 27, 29, 31) for counting and generating a histogram of the k-mer frequencies. We used a perl script (written by Joseph Ryan) to estimate genome size based on k-mer sizes and peak values determined from histograms generated in jellyfish.
We used tandem repeats finder (trf v4.07b; Benson, 1999) to identify tandem repeats throughout the unmasked genome. We used the following parameters in trf 1 1 2 80 5 200 2000 -d -h to identify repeats. Once the largest repeats were identified, we used period size of the repeats multiplied by the number of copies of the repeat to generate the largest fragments. This method was used to identify repetitive regions which can possibly represent centromere or telomere regions of the C. violaceus genome.
We used busco v3 (Simão et al. 2015) to estimate the completeness of our genome assembly with the vertebrata and Actinopterygii gene set [consists of 2,586 Benchmarking Universal Single-Copy Orthologs (BUSCOs)] to estimate completeness of our C. violaceus genome.
RNA-Seq Tissue Extraction and Sequencing
Multiple individuals identified as C. violaceus were collected during the fall of 2015 for our transcriptomic analyses and annotation of the C. violaceus genome. The following IDs were selected for the five individuals for the transcriptomic analyses: CV96, CV97, CV98, CV99, and CV100. We selected nine tissues: brain, gill, gonads (testes), heart, liver, mid intestine, proximal intestine, pyloric caeca, and spleen. We extracted tissues from multiple C. violaceus individuals and preserved tissue in RNAlater® (Ambion, Austin, TX, USA). Total RNA was extracted using a Trizol protocol, and an Agilent bioanalyzer 2100 (RNA nano chip; Agilent Technologies) was used to review quality of the samples. In order to produce a sufficient amount of total RNA from our extractions, we used a single or multiple C. violaceus individuals (up to four) to represent a specific tissue. We used the Illumina TruSeq Sample Preparation v2 (Illumina) kit along with AMPure XP beads (Beckman Coulter Inc.) and SuperScript™ III Reverse Transcriptase (Invitrogen) to prepare our tissue samples for Illumina sequencing. The following adaptor indexes were used for our analysis: brain - 6, gill - 2, gonadal tissue - 13, heart - 7, liver - 5, pyloic ceaca - 12, proximal intestine - 4, mid intestine - 14, and spleen - 15 (Supplementary Table S3). We size selected fragments that were on average 331 bp. In addition, we used a High Sensitivity Agilent Assay on an Agilent Bioanalyzer 2100, and Kappa qPCR for quantitative analyses before samples were sequenced. We multiplexed samples to 10 nM in 10 ul and were sequenced on two lanes on an Illumina HiSeq 2500 (100 bp Paired Ends) at UC Irvine’s GHTF.
Transcript Assembly for all Tissues and Annotation
The following pipeline was used to assemble and measure expression of all transcripts from all nine tissue types (Supplementary Figure S2). Prior to assembly, all raw reads were trimmed with trimmomatic v0.35 (Bolger et al. 2014). Afterwards, trimmed reads were normalized using a perl script provided by trinity v r2013-02-16 (Grabherr et al. 2011). Prior to aligning transcriptomic reads to the genome, the final masked assembled genome was prepared with bowtie2-build v2.2.7 (Langmead and Salzberg, 2012) for a bowtie index and then all (normalized) reads from each tissue type were mapped using tophat v2.1.0 (Kim et al. 2013) to our assembled masked genome using the following parameters -I 1000 -i 20 -p 4. Afterwards, aligned reads from each tissue were indexed with samtools v1.3 (Li et al. 2009) as a BAM file. Once indexed through samtools, transcripts were assembled by using cufflinks v2.2.1 (Trapnell et al. 2012) with an overlap-radius 1. All assemblies were merged using cuffmerge and then differential expression was estimated with cuffdiff, both programs are part of the cufflinks package. All differential expression analyses and plots were produced in R (https://www.r-project.org/) using cummerbund tool located on the bioconductor website (https://www.bioconductor.org/). Once all transcripts were assembled, we ran repeatmasker v.4.0.6 with the parameter -species teleostei to mask repetitive elements within our transcriptomes.
All masked transcripts were annotated with the trinotate annotation pipeline (https://trinotate.github.io/), which uses Swiss-Prot (Boeckmann et al. 2003), Pfam (Finn et al. 2013), eggNOG (Powell et al. 2014), Gene Ontology (Ashburner et al. 2000), SignalP (Petersen et al. 2011), and Rnammer (Lagesen et al. 2007). We also processed our transcripts through blastx against the UniProt database (downloaded on September 26th, 2017) with the following parameters: num_threads 8, evalue 1e-20, and max_target_seqs 1. The blastx output was processed through trinity analyze_blastPlus_topHit_coverage.pl script to count the amount of transcripts of full length or near full length. To provide a robust number of full-length transcripts, the assembled genome was processed through AUGUSTUS v3.2.1 (Stanke et al. 2006) without hints using default parameters for gene predictions using a generalized hidden Markov model in order to identify genes throughout the genome, and predicted transcripts were also masked for repetitive elements through repeatmasker.
Heatmap of all Tissues and Genes Associated with Diet Specialization
Differentially expressed genes for all tissue types were viewed with a heatmap that was generated with the cummerbund library (Supplementary Figure S3). Candidate genes which pertained to glycolytic, lipid metabolism/gluconeogenesis, ketone degradation, glucosidases (both α and β), proteases, and lipases were identified in the C. violaceus transcriptome by scanning the annotation of cufflinks assembled transcripts and used to generate our heatmap.
Identification of Candidate Genes and Copy Number
Amylase and Bile Salt Activated Lipase are candidate genes of interest due to their properties of breaking down starch (α-glucans) and dietary lipids, respectively, and we were interested in identifying copy number of these two candidate genes. Previously published variants of amy2 (amy2a and amy2b; German et al. 2016) found in C. violaceus which were deposited on NCBI (KT920438 and KT920439) were used as queries and searched throughout our assembled genome using both mummer v3.23 (Kurtz et al. 2004) and blast (Altschul et al. 1997) to identify gene copies. Afterwards, we used a perl script (Lawrence et al. 2015) to trim the contig which contained the amylase gene copies and neighboring loci and were viewed with the online version of AUGUSTUS v3.2.3 (Stanke and Morgenstern, 2005). Next, we used amylase sequences from multiple Stichaeid intertidal species that represent a broad spectrum of diet specializations (German et al. 2016; Fig. 1). We selected Anoplarchuspurpurescens (carnivore), Dictyosoma burgeri (carnivore), Phytichthys chirus (omnivore), Xiphister atropurpureus (omnivore), and X. mucosus (herbivore; German et al. 2016). Our predicted amylase sequences from C. violaceus and orthologous sequences from the five other intertidal prickleback species were aligned in mega v7.0.26 (Kumar et al. 2016) with muscle (with codons) and then selection was estimated using adaptive Branch Site REL (aBSREL) and Mixed Effects Model of Evolution (MEME) as well as signatures of recombination Genetic Algorithm for Recombination Detection (GARD) as part of the datamonkey 2.0 web application (Weaver et al. 2018).
To identify Bile Salt Activated (BSA) lipase in the C. violaceus genome, we used the haddock (Melanogrammus aeglefinus; AY386248.1) BSA lipase and BLASTed this coding gene against our assembled transcriptomes where the highest bit score and percent identity (greater than 70%) were used to identify the orthologs. Once BSA lipase transcripts were identified, then we used mummer and blast to identify gene copies of BSA lipase within the C. violaceus genome. Again, we used a perl script (Lawrence et al. 2015) to trim the contig which only contained the BSA lipase genes and neighboring loci and were viewed with AUGUSTUS. We identified orthologous sequences from assembled transcriptomes from Xiphister mucosus, Xiphister atropurpureus, Anoplarchus purpurescens, and Phytichthys chirus pyloric cacea samples, which were generated for another study (Herrera et al. unpublished). All sequences were aligned in mega v7.0.26 and then selection and recombination were estimated in datamonkey server v2.0 with aBSREL, MEME, and GARD.
Identification of orthologs across teleost fishes
All C. violaceus transcripts predicted from AUGUSTUS were used for identifying orthologs among ensembl protein datasets of bony (teleost) fishes and a lobed finned fish (coelacanths). Then the following teleost and non-teleost genomes were taken from ensembl release 89 for our comparative analysis of orthologs and phylogeny of fishes; Poecilia formosa (Amazon molly), Astyanax mexicanus (blind cave fish), Gadus morhua (Atlantic Cod), Takifugu rubripes (Fugu), Oryzias latipes (Japanese medaka), Xiphophorus maculatus (platyfish), Lepisosteus oculatus (spotted gar), Gasterosteus aculeatus (stickleback), Tetraodon nigroviridis (green spotted puffer), Oreochromis niloticus (Nile tilapia), Danio rerio (zebrafish), Latimeria chalumnae (coelacanth). All C. violaceus transcripts were translated into protein sequences by using ORFPREDICTOR (Min et al. 2005). From all 13 fish species, only sequences with 60 amino acids or longer were used for our analyses. We conducted a pairwise identification of orthologs by using INPARANOID v4.0, (O’brien et al. 2005) in which we conducted 78 possible pairwise comparisons, where N is number of taxa [(N(N-2))/2= possible pairwise comparisons]. From the outputs of inparanoid, we used quickparanoid (http://pl.postech.ac.kr/QuickParanoid/) to identify orthologous clusters from all 13 species. Once single copy orthologs were identified from all datasets, we used custom python scripts to align sequences using MUSCLE (Edgar, 2004) and then a consensus phylogenetic tree was generated using PhyML v3.1 (Guindon et al. 2010) with 1,000 bootstrap replicates based on the lowest average gaps present in our ortholog cluster alignments of concatenated 30 protein coding genes from all 13 taxa.
Comparative Analysis for Syntenic Regions across Teleosts fishes
We selected the following fish genomes: zebrafish (Danio rerio), stickleback (Gasterosteus aculeatus), spotted gar (Lepisosteus oculatus), and the Japanese medaka (Oryzias latipes) to identify syntenic regions with our C. violaceus genome assembly. All genomes were masked using repeatmasker v.4.0.6 with the parameter -species teleostei. To identify syntenic regions we used contigs from our C. violaceus which represent 1MB or larger and then concatenated the remaining contigs. We used satsuma v3.1.0 (Grabherr et al. 2010) with the following parameters -n 4 -m 8 for identifying syntenic regions between zebrafish, stickleback, spotted gar, Japanese medaka and the C. violaceus genome. Afterwards, we developed circos plots to view syntenic regions shared between species using CIRCOS v0.63-4 (Krzywinski, 2009).
ACKNOWLEDGEMENTS
We would like to thank H. Yip and Q.B. Nguyen-Phuc for assistance in collected samples for genomic and transcriptomic analyses. D. Canestro for providing assistance and facilities at the Kenneth S. Norris Rancho Marino Reserve in Cambria, CA while collecting samples for this project. We would like to thank M. Oakes, V. Ciobanu, S.A. Chung, D. Yu, and Y. Kanomata at the UC Irvine Genomics High-Throughput Facility. A. Long, J. Baldwin-Brown, and K. Thorton for RNA-Seq assembly suggestions. The comparative physiology group at UC Irvine for providing feedback on the content of this manuscript. We would also like to thank N. Nirale for assistance in uploading all genomic and transcriptomic sequences onto NCBI’s GenBank. All sequence data was deposited to NCBI’s GenBank under the bioproject ID: PRJNA384078. Included under this bioproject are the Illumina genomic and tissue transcriptomic objects under SAMN06857690-SAMN0687699, 40 SMRT cells of PacBio sequencing under SAMN06857690, and the final genome assembly (NJBE00000000). This project was funded through NSF-IOS 1355224. Lastly, authors declare no conflicts of interests.