Large scale gene duplication affected the European eel (Anguilla anguilla) after the 3R teleost duplication

Genomic scale duplication of genes generates raw genetic material, which may facilitate new adaptations for the organism. Previous studies on eels have reported specific gene duplications, however a species-specific large-scale gene duplication has never before been proposed. In this study, we have assembled a de novo European eel transcriptome and the data show more than a thousand gene duplications that happened, according to a 4dTv analysis, after the teleost specific 3R whole genome duplication (WGD). The European eel has a complex and peculiar life cycle, which involves extensive migration, drastic habitat changes and metamorphoses, all of which could have been facilitated by the genes derived from this large-scale gene duplication. Of the paralogs created, those with a lower genetic distance are mostly found in tandem repeats, indicating that they are young segmental duplications. The older eel paralogs showed a different pattern, with more extensive synteny suggesting that a Whole Genome Duplication (WGD) event may have happened in the eel lineage. Furthermore, an enrichment analysis of eel specific paralogs further revealed GO-terms typically enriched after a WGD. Thus, this study, to the best of our knowledge, is the first to present evidence indicating an Anguillidae family specific large-scale gene duplication, which may include a 4R WGD.


Introduction
6 2 by using a transitive clustering approach to create sets of very similar transcripts. The number of 1 3 9 unigenes (henceforth referred to as transcripts) assembled ranged from 68489 to 78610 (table  1  4  0 1) and the number of transcript clusters from 49154 to 55667 (henceforth referred to as genes; 1 4 1  The genomes of zebrafish (Danio rerio), northern pike, spotted gar (Lepisosteus oculatus), fugu 1 4 6 (Takifugu rubripes), and platyfish (Xiphophorus maculatus) were obtained from the ENSEMBL 1 4 7 9 1 0 Species cladogram generated by PHYLDOG for the species included in this study: European eel (Anguilla anguilla), zebrafish (Danio rerio), northern pike (Esox lucius), spotted gar (Lepisosteus oculatus), fugu (Takifugu rubripes), platyfish (Xiphophorus maculatus), Atlantic salmon (Salmo salar), elephantnose fish (Gnathonemus petersii) and silver arowana (Osteoglossum bicirrhosum). PHYLDOG also determined the duplication events and for each of these events the 4dTv and the synteny type found around the gene was determined. Only the 4dTv distributions for the branches with most duplications are represented over the corresponding cladogram branch, for the distributions for all branches refer to supplementary figure 2. The synteny types are the following: close, the copies originated by the duplication are close in the genome; some synteny, some genes close to the one duplicated are also found to be duplicated close by; no synteny, there are no paralogs for others genes found close to the paralog copies created by the duplication; no information, the duplicated genes are located in small scaffolds with not enough genes close by; conflicting syntenies, different synteny classification found in the genomes of the different species affected by the duplication) 1 8 6

Figure 3
Barplot of the number of duplications PHYLDOG assigned to each branch of the species tree. Bars are numbered according to the cladogram in the upper righthand corner. 3R indicates the branch where the 3R teleost-specific whole genome duplication is hypothesized to have happened. 3R-B indicates the basal branch of the remaining teleosts after the split of the elopomorphas and osteoglossomorphas. Each bar is subdivided into the synteny types described in figure 2. In order to investigate the timing of the main eel duplication event in greater depth, we 2 1 9 compared the 4dTv distribution found for eel paralogs with the 4dTv distribution built for the eel 2 2 0 orthologs with elephantnose fish, and arowana. The results showed a 4dTv maximum of 0.4 for 2 2 1 the main eel paralog peak, and 0.5 for the peaks corresponding to the speciation event that 2 2 2 separated elephantnose fish and arowana from eel (Fig. 4). 2 2 3 Figure 4 4dTv distribution of European eel (Anguilla anguilla) paralogs (blue), European eel and elephantnose fish (Gnathonemus petersii) orthologs (green), and European eel and silver arowana (Osteoglossum bicirrhosum) orthologs (orange). This analysis was carried out in the gene families without taking into account the PHYLDOG result.

4
Investigation of functional category enrichment 2 2 5 To investigate if some functional categories were overrepresented in the eel duplications, an 2 2 6 enrichment test was carried out. GO terms were assigned to 9570 gene families by comparing The resulting enriched GO-terms, are presented in the suppl. table 2. In most cases these terms 2 3 1 are involved either in signal transduction (GTPase, MAPK, sphingosine-1-phosphate), 2 3 2 morphological alterations (convergent extension involved in axis elongation and gastrulation, 2 3 3 heart development, and pronephric glomerulus development, pigmentation), or forebrain 2 3 4 development.

3 5
Additionally, KEGG terms were assigned to 16466 eel genes using BlastKOALA (Kanehisa et 2 3 6 al. 2016) and the terms related to the genes involved in the eel duplication were mapped onto 2 3 7 the KEGG pathways using the KEGG Mapper tool. A fisher test, corrected for multiple testing 2 3 8 using False Discovery Rate, was used to look for enriched KEGG pathways in the eel branch 2 3 9 (Suppl. Table 3). Most of the KEGG pathways found to be enriched were related to immune 2 4 0 system, nervous system, oocyte, apoptosis, cell adhesion, amino acid metabolism, glycan 2 4 1 biosynthesis, and signal transduction. Several key genes related to the immune system were 2 4 2 also duplicated, including: Cytohesin-associated scaffolding protein (casp) c-jun N-terminal 2 4 3 knase (jnk) and vav proto-oncogene (vav), involved in the lymphocyte differentiation and 2 4 4 activation, the interleukine and T-cell receptors (tcr's), major histone compatibility complex 2 4 5 (mch) I, mch II, and several cytokines including tgf-beta (transforming growth factor beta) and B-2 4 6 cell activating factor (baff). There were also numerous duplications related to apoptosis, 2 4 7 including activator protein 1 (ap-1), c-fos proto-oncogene (c-fos), casp, serine protease (omi), 2 4 8 mitochondrial Rho GTPase (miro), Death executioner (bcl-2), Caspase Dredd, X-linked inhibitor 2 4 9 of apoptosis protein (xiap), casp8, and Baculoviral inhibitor-of-apoptosis repeat (bir). In the 2 5 0 nervous system tyrosine hydroxylase (th) and monoamine oxidase, as well as some receptors 2 5 1 (gaba-a, gaba-b, ampa, mglur, crfr1, and girk) were duplicated, as well as several genes 2 5 2 involved in synaptic exocytosis (rab3a, munc-18, syntaxin, vgat), and Golf and Transductin, 2 5 3 which are involved in olfaction and phototransduction. In the oocyte more affected genes were 1 4 phosphatase (glc7), cullin F-box containing complex (scf-skp), insulin-like growth factor 1 (igf1), 2 5 6 and cytoplasmic polyadenylation element binding protein (cpeb). 2 5 7 2 5 8 Discussion 2 5 9 The observed data has shown, for the first time to our knowledge, that in the eel lineage more 2 6 0 than a thousand gene families have genes that are the result of a large-scale gene duplication 2 6 1 that happened after the teleost specific 3R WGD. Only Atlantic salmon, due to its salmonid-2 6 2 specific 4R WGD, showed a higher amount of conserved duplicates, whereas the number of 2 6 3 duplicated gene families found in eel and zebrafish were similar. Furthermore, the paralog 4dTv 2 6 4 distribution shows a unique pattern in the case of the eel branch when compared to the other 2 6 5 teleosts.

6 6
The duplications assigned by the phylogenetic analysis to the eel specific branch, after the split 2 6 7 of arowana and elephantnose fish, showed a distribution of 4dTv distances younger than those 2 6 8 corresponding to the 3R duplication and older than the salmonid duplications. This result was 2 6 9 replicated in an independent analysis, not based on phylogenetic tree topologies. In this case 2 7 0 the distributions of the 4dTv distances ( Fig. 4) were calculated between the eel genes found in 2 7 1 the gene families (eel paralogs) and between the eel sequences and the arowana and 2 7 2 elephantnose fish sequences (orthologs). Both the phylogenetic topologies and the 4dTv 2 7 3 distances corroborate the hypothesis that the main eel duplication happened after the teleost To be more precise about the timing of the main 2 7 9 eel duplication event it would be advisable to study other elopomorpha transcriptomes or 2 8 0 genomes to analyze if they share the same duplication.

8 1
Not all the PHYLDOG duplication assignments to the basal branches of the species tree are as 2 8 2 trustworthy as the lineage specific ones. According to PHYLDOG the duplications usually 2 8 3 associated with the 3R WGD would be split into two events, one would correspond to the 3R 2 8 4 branch and the other with the one that gave rise to zebrafish, salmon and fugu (indicated as 3R-2 8 5 B; Fig. 2). If PHYLDOG is right and there were really two genome duplications the 4dTv In eel, the 4dTv distribution pattern found is quite distinct as it reflects two modes; one of 3 6 1 younger and one of older duplications. Of these duplications, the older ones are clearly older 3 6 2 than those found, for example, in zebrafish or salmon. Furthermore, the genomic surroundings 3 6 3 of the paralogs of the younger duplications are quite different from those of the older 3 6 4 duplications. The younger duplications, which have a lower 4dTv, tend to be located close 3 6 5 together in the genome and are likely to have been generated recently by tandem SDs, whereas 3 6 6 the older duplications are not usually found in tandem. A WGD should have left behind blocks of 3 6 7 syntenic regions similar to those detected in our analysis for the 3R teleost and the salmon 3 6 8 WGD. In the case of the eel, an increase in these syntenic blocks was also detected in the older 3 6 9 duplications. However most genomic regions are very fragmented in the genome assembly and 3 7 0 thus, we lack physical genomic location information for many genes. However, from the 3 7 1 evidence available we can hypothesize that most of the older duplications are likely to be the 3 7 2 result of a WGD which occurred in the eel lineage, and that an analysis with the latest eel 3 7 3 genome assembly published (Jansen et al. 2017), but not available without restrictions, would 3 7 4 detect more syntenic regions. In other words, the numerous duplications found in eel are likely 3 7 5 to have been generated by a WGD followed by many SDs, a pattern which has also been 3 7 6 observed in primates (Gu et al. 2002), and common carp (David et al. 2003).

7 7
In this study, several gene function analyses were carried out to study overrepresented 3 7 8 functions among the eel specific duplications. These overrepresentations could be linked to 3 7 9 several adaptations that have taken place throughout eel evolution, e.g. the inclusion of a Specifically, 54 GO-terms were found to be enriched among the eel specific duplications (suppl. 3 8 9 table 2). Interestingly, several of the GO-terms found to be enriched among the eel specific 3 9 0 duplications form part of some of the aforementioned processes, including; development, ion 3 9 1 transport, signaling, neuronal function, and metabolism. The high number of enriched GO-3 9 2 terms, which are part of processes that are often conserved after WGDs, suggests that the 3 9 3 duplication event here described is a WGD. It is likely that they have been conserved due to the 3 9 4 mechanisms regulating gene conservation after WGD rather than due to specific necessities of 3 9 5 the Anguilla species. Other GO terms that were found to be duplicated in eel are not usually 3 9 6 found in other WGDs, for example "pigmentation". As the eel has incorporated several 3 9 7 pigmentation alterations into its lifecycle, it is likely that the genes associated with this GO-term 3 9 8 are conserved due to new functions acquired in the Anguilla species. Most of the pigmentation 3 9 9 changes undergone by the eel are linked to the transition between the marine and freshwater 4 0 0 environment, therefore the duplication of these genes might have generated the necessary raw 4 0 1 genetic material for adaptation to the catadromous lifecycle. 4 0 2 Furthermore, 54 KEGG pathways were also found to be enriched among the eel specific where DA has been found to have an inhibitory role on the gonadotropic activity of the pituitaries 4 2 4 (Dufour et al. 2005;Peter et al. 1986). In the case of the eel in particular, DA inhibition 4 2 5 completely arrests puberty before their oceanic migration (Vidal et al. 2004), indicating that DA 4 2 6 has a much stronger inhibitor effect in eel compared to most other teleosts (Dufour 1988; Vidal 4 2 7 et al. 2004). This suggests that the duplicated genes involved in the dopaminergic synapse 4 2 8 pathway may have been conserved during the adaptation to block maturation until after the 4 2 9 extensive reproductive migration. Among the duplicated genes assigned to the dopaminergic 4 3 0 synapse pathway, we found tyrosine hydroxylase (TH). TH is the rate limiting enzyme of the DA 4 3 1 biosynthesis (Nagatsu et al. 1964), and is therefore often used as an indicator of DA tone in eel 4 3 2 (Davila et al. 2003;Weltzien et al. 2015). As genes are rarely conserved without a specific 4 3 3 function or necessity, the presence of two TH genes in the eel encourages suspicion of potential 4 3 4 differential expression or function between the two, which may prove important for the regulation 2 1 In conclusion, the data presented strongly suggest that a vast amount of genes have been 4 3 7 duplicated specifically in the eel lineage. Furthermore, the synteny, 4dTv, and enrichment 4 3 8 analyses suggest that these genes derive both from a WGD as well as continuously created 4 3 9 SDs, and that they are related to the eel specific physiology. To our knowledge this is thus the 4 4 0 first evidence published suggesting a possible eel lineage specific 4R WGD. individual recirculation systems, a temperature control system (with heaters and, coolers), and 4 4 8 aeration. The fish were gradually acclimatized to sea water (final salinity 37 ± 0.3‰), over the 4 4 9 course of two weeks. The temperature, oxygen level and pH of rearing were 20 ºC, 7-8 mg/L 4 5 0 and ~ 8.2, respectively. The tank was covered to maintain, as much as possible, a constant 4 5 1 dark photoperiod and the fish were starved throughout the holding period. After acclimation, the 4 5 2 fish were sacrificed in order to collect samples of forebrain, pituitary, and testis tissues. Transcriptome assemblies and genomes 4 7 7 The software FastQC (Andrews 2010) was used to assess the quality of the raw reads 4 7 8 generated by Macrogen. Thereafter, trimmomatic (Bolger et al. 2014) was used to trim the 4 7 9 reads, eliminating known adaptor sequences, and low quality regions. Finally, trimmed reads 4 8 0 shorter than 50 bp were filtered out. Eel reads were digitally normalized before assembly by The transcripts assembled were filtered according to their complexity (with a DUST score 4 8 6 threshold of 7 and a DUST window of 64), length (with a minimum length of 500 bp), and level of expression (with a TPM threshold of 1). After assembly, the CDSs and proteins were 4 8 8 annotated using the Trinotate functional annotation pipeline (Haas et al. 2013).

8 9
Transcripts that share k-mers are clustered by Trinity, however, these transcripts might 4 9 0 correspond to different transcript forms of the same gene or to closely related genes from a 4 9 1 gene family. We split these transcripts into genes by running a transitive clustering based on a 4 9 2 blast search. In this clustering transcripts which shared at least 100 bp with a minimum identity 4 9 3 of 97% were considered to be isoforms of the same gene. Thus, some Trinity clusters were split 4 9 4 into several genes. For each gene, the most expressed transcript, according to Salmon (Patro 4 9 5 et al. 2017), was chosen as its representative. 4 9 6 The available eel genome was downloaded from the ZF-Genomics web site (Henkel et al. Genome and transcriptome quality assessment 5 0 6 In order to check the quality of the transcriptomes and genomes we looked for the BUSCO 5 0 7 conserved gene set in them (Simão et al. 2015). BUSCOs are conserved proteins, and are 5 0 8 expected to be found in any complete genome or transcriptome. Therefore, the number of 5 0 9 present, missing, or fragmented BUSCOs can be used as a quality control of a genome or 5 1 0 transcriptome assembly. For this assessment the Actinopterygii (odb9) gene set, which consists 5 1 1 of 4584 single-copy genes that are present in at least 90% of Actinopterygii species was used. 5 1 2 As an additional comparison between the transcriptome and genomes of pike and eel, the RNA-5 1 3 43566 aminoacids. The model used was CAT-GTR and three independent MCMC chains were 5 4 0 run for 39872, 56328, and 39285 iterations.

4 1
Finally, a neighbor joining tree based on the fourfold synonymous third-codon transversion 5 4 2 distances (4dTv) was also calculated (Tang et al. 2008). Between any pair of sequences the 5 4 3 number of transversions found in the third base of the codon was divided by the number of four-5 4 4 fold degenerated codons. A correction to the 4dTv was applied: ln(1 -2 * distance) / -2. For 5 4 5 each pair of species a 4dTv distance was calculated. 4dTvs were calculated between the 5 4 6 sequences of those species found in each gene family codon alignment. The distribution of 5 4 7 those 4dTvs was fitted with a log normal mixture model using the scikit-learn Gaussian Mixture 5 4 8 class. The number of components required was one for all the species pairs, except for those 5 4 9