ABSTRACT
Black flounder (Paralichthys orbignyanus) is an economically important ma-rine fish with aquaculture potential in Argentina due to its market value. In this study, we sequenced the whole genome using an Illumina sequencing technology. We started with two independent libraries (from one female and one pool of females; each with 150 bp paired-end reads, a mean insert length of 350 bp, and >35 X-fold coverage). Each library was assembled separately using SOAPdenovo2 and the resulting contigs were scaffolded with SSPACE3 before gaps were filled with GapCloser. In vertebrates, including teleosts, the number of transposable elements (TEs) is related to genome size, but it remains unclear whether the size of introns and exons also plays a role. Therefore, the main objective of the present study was to test whether the small genome size of Pleuronectiformes is related to the size of their introns and exons. The assemblies re-sulted in a genome size of ∼538 Mbp (41.35% GC content, 0.11% undetermined bases). Analysis of the assemblies at the core genes level (subset of the 458 universally ex-pressed KOG families) revealed that more than 98% of core genes are present, with more than 78% of them having more than 50% coverage. This indicates a fairly complete and accurate genome at the coding sequence level. Prediction of genes based on statistical predictors (geneid) and sequence-based predictors (Exonerate, using a closely related species, Paralichthys olivaceus, as a reference) was performed. This revealed 25,231 protein-coding genes, 445 tRNAs, 3 rRNAs, and more than 1,500 non-coding RNAs of other types (including a complete set of spliceosomes and several types of snoRNA and miRNA). As a result, this study concluded that the reduced genome size of flounders is related to a reduction in transcript size, mainly through a reduction in exon number, but also through a reduction in large introns. Thus, both components seem to be involved in the strategy of genome reduction in Pleuronectiformes.
INTRODUCTION
Pleuronectiforms are an interesting order of fish due to their adaptations to de-mersal life and the presence of a drastic metamorphosis from bilateral pelagic larval symmetry to flatfish symmetry in the adult stage (Robledo et al., 2017; Lü et al., 2021). This order consists of a total of 772 species distributed among 129 genera and 14 families (Nelson et al., 2016). Some of the species are of great economic importance due to their importance to fisheries and aquaculture (Seitz et al., 2008). Within this order, the genus Paralichthys is one of the most relevant species for aquaculture. In Asia, aqua-culture of the Japanese flounder (Paralichthys olivaceus) is well established (Seikai, 2002). In Latin America, other species belonging to this genus have also been identified with potential for aquaculture. Worth mentioning because of their importance to fish-eries and aquaculture are Chilean flounder (Paralichthys adspersus), endemic to Peru and Chile, and black flounder (Paralichthys orbignyanus) in Brazil, Argentina, and Uruguay (Díaz de Astarloa, 2002).
Black flounder inhabit shallow estuaries and coastal waters from Rio de Janeiro (Díaz de Astarloa, 2002) to the Gulf of San Matías in northern Patagonia (Díaz de Astarloa and Munroe, 1998). As this species is a promising candidate for aquaculture, numerous zootechnical studies have been conducted on it in the last two decades (Bam-bill et al., 2006; Boccanfuso et al., 2019; López et al., 2009; Magnone et al., 2015; Sampaio and Bianchini, 2002; Sampaio et al., 2007; Radonic et al., 2007; Radonic and Macchi, 2009). However, there is an important knowledge gap in the genetic resources for this species. For example, only 12 protein sequences are deposited in the GenBank® sequence database (http://www.ncbi.nlm.nih.gov/). This is even more remarkable considering that genetics and genomics are considered one of the most important pieces of information for the development of the aquaculture and fisheries industry (Cerdà et al., 2010). This is particularly important for the study of sex determination and differentiation, sex ratio, nutrition, reproduction, genetic population structure, and other important biological traits. In this context, a reference genome is an essential requirement for the study of all these processes and mechanisms.
In the last decade, next-generation sequencing (NGS) technology has gained acceptance and played a critical role in obtaining whole genome sequences from different of non-model fish species. According to the Ensembl Genome Browser (https://www.ensembl.org), hundreds of fish species with whole genome sequences (WGS) are available to date, and the number of species whose genomes have been sequenced continuous increasing. Among Pleuronectiformes, for example, thirteen species have their genome sequenced: tongue sole (Cynoglossus semilaevis) (Chen et al., 2014), Turbot (Scophthalmus maximus) (Maroso et al., 2018), Senegalese sole (Solea senegalensis) (Guerrero-Cózar, et al., 2021), Japanese flounder (Paralichthys olivaceus) (Shao et al., 2017), Spotted halibut (Verasper variegatus) (Zhao et al., 2021), Hogchoker (Trinectes maculatus), Pelican flounder (Chascanopsetta lugubris), Oriental sole (Brachirus orientalis), Bloch’s tonguesole (Paraplagusia blochii), New Zealand turbot (Colistium nudipinnis), Ocellated flounder (Pseudorhombus dupliocellatus), Starry flounder (Platichthys stellatus), and Indian halibut Psettodes erumei) (Lü et al., 2021). In addition, recent studies have compiled the genomes at the chromosome level of turbot (Martínez et al., 2021) and spotted halibut (Zhao et al., 2021). A recent study also analyzed the origin of the specialized body structure of flatfishes, the genomes of 11 flatfish species representing 9 of the 14 pleuronectiform families of this order were sequenced (Lü et al., 2021). In this frame, sequencing and annotation of the black flounder genome would expand the knowledge of this species so that alternative genomic methods could be used to improve fish breeding and to conduct comparative and evolutionary studies with other flatfish species.
Comparative genomics has revealed striking differences in genome size even in closely related species. In teleosts, this variation is exceptional, with enormous variance not only in genome size but also in chromosome number (Mank and Avise, 2006; Sarropoulou and Fernandes, 2011; Yuan et al., 2018). Previous studies have clearly shown that the Pleuronectiformes, along with the pufferfishes, seahorses, pipefishes, and Anabantiformes have less DNA compared to other teleost groups (Hinegardner, 1968; Hinegardner and Rosen, 1972), suggesting that they have smaller genomes. One of the hypotheses linked a small genome is the reduction in size of exons, and especially introns (Zhang and Edwards, 2012). Alternatively, the frequency of repetitive elements was positively correlated with genome size (Yuan et al., 2018). Pleuronectiform flatfish have only less than 9.0% of repetitive elements (Aparicio et al., 2002), compared to up to 60% in the salmon genome (Lien et al., 2016). However, the causes and consequences of this putative reduction in the genomes of these fish are still far from clear. In general, the variations in genome size are explained by the balance between genomelevel mechanisms that lead to genome enlargement (duplication, transposable elements, and polyploidy) and genetic mechanisms that lead to genome reduction (deletions and DNA repair mechanisms) (Brainerd et al., 2001).
Thus, the main goal of the present study was to perform the first sequencing and characterization of the whole genome of black flounder. And therefore, to proportionate an important genomic tool for further research on this species. In addition, comparative genome analysis was used to determine whether Pleuronectiformes, present a small genome size compared to other teleost species. The analysis compares the exon and intron size of genes in different Pleuronectiformes in relation with homologous genes in the other teleost genomes and also considers the number of repetitive elements.
MATERIAL AND METHODS
Fish sampling and DNA extraction
Adult black flounder, Paralichthys orbignyanus were obtained from the Estación Experimental de Maricultura (INIDEP, Argentina). Fin tissue samples were collected and preserved in 96% ethanol. Genomic DNA was extracted from one adult female and a pool of females (n=10). Samples were lysed in 300 μl SSTNE extraction buffer (Blanquer, 1990) with SDS (0.1%) and 5 μl proteinase K (20 mg/ml) for 3 h at 55 °C. After 20 min at 70 °C, samples were treated for RNA digestion with 7.5 μl RNAse (10 mg/ml) for 1 h at 37 °C. Total DNA was purified with cold absolute ethanol (1 ml) after protein precipitation with 5 M NaCl. DNA quality (high molecular weight > 20 kb) was first assessed on agarose gels and DNA quantity was measured using NanoDrop® ND-1000 spectrophotometer (NanoDrop® Technologies Inc). Finally, DNA concentration was accurately measured using a Qubit fluorometer (Life Technologies). Fish were handled in accordance with internal institutional regulations (CICUAE-INIDEP).
Genome sequencing and genome assemblies
All samples were adjusted to 300 ng/uL and fragmented by ultrasonication on Covaris, aiming for a fragment size of 300 bp. Barcoded libraries were prepared using Wafergen’s PrepX ILM DNA Library Kit on Apollo and beads size selected before being pooled and sequenced at 150 bp at the paired end on HiSeq-4000. All procedures followed the manufacturer’s instructions. We performed multiple QC sets during library preparation using 300 ng of human gDNA samples as internal control and all QC passed for this project.
Two Illumina paired-end libraries were constructed (from one female and one pool of females) and sequenced with a read length of 150 bp and an average fragment size of ∼350 bp. Reads were first filtered using Trimmomatic (Bolger et al., 2014) and PrinSeq (Bolger et al., 2014; Schmieder and Edwards, 2011) to remove adapters and low-quality reads, and then sequencing errors were corrected using Quake (Kelley et al., 2010).
The libraries were assembled individually using SOAPdenovo 2.04 (Luo et al., 2012) with a kmer size of 29, gap filled with GapCloser (Boetzer and Pirovano, 2012) and finally scaffolded with SSPACE3 (Boetzer et al., 2011) with default parameters.
Genome annotation
First, genomes were masked with Repeatmasker (Hancock, 2004) using the Repbase (Bao et al., 2009) and DFam (Hubley et al., 2016) databases. Using the masked scaffolds, we searched for t-RNAs, r-RNAs and nc-RNAs using tRNAScan-SE (Zou et al., 2015), RNAmmer (Lagesen et al., 2007), and Infernal (Nawrocki and Eddy, 2013) + RFAM (Griffiths-Jones 2003), respectively. Then, we performed gene prediction using GeneId (Blanco and Abril 2009) and Exonerate (Slater and Birney, 2005) trained with Uniprot/Swissprot (Pundir et al., 2016) and the Japanese flounder (Paralichthys olivaceus) genome (Shao et al., 2017). All proteins were annotated using InterProScan (Hancock and Bishop, 2004).
Gene family clustering and validation
To find orthologous proteins in other fish species, we downloaded the genome data of D. rerio (Howe et al., 2013), C. semilaevis (Chen et al., 2014), S. maximus (Xu et al., 2020), P. olivaceus (Shao et al., 2017 from the Zebrafish Genome Project and Ensembl. All protein sequences were aligned using BLASTP (Altschul et al., 1990) with an E value cutoff less than 1e−5. BLASTP results were parsed and imported into a MySQL database. Tables within were created by OrthoMCL (Li et al., 2003) to identify homologues with thresholds of percentMatchCutoff = 50 and evalueExpo-nentCutoff = 1e−6. To evaluate our annotation, we used Benchmarking Universal Single-Copy Orthologs (BUSCO, v3.0.2)37 to assess the assembled genome sequences. We used BUSCO with single copy orthologues from actinopterygii_odb10 to assess the completeness of the genome assembly and annotation.
Identification and comparison of Transposable elements
Transposable elements (TEs) were quantified for each genome applying a homology-based approach using RepeatMasker (Tarailo-Graovac et al., 2009) with Actinopterygii species repeats and default parameters. Relationship between genome size and TEs percentage in the genome were plotted using ggplot2 R package, and correlations were calculated by Spearman method.
Analysis of C-value in teleosts
All fish C-values (in pg) available were retrieved from the Animal Genome Size Database (http://www.genomesize.com) (Gregory, 2020). As of March 2023, the website contained data for more than 1800 fish species from more than 70 orders (48 entries correspond to Pleuronectiformes species). When necessary, the average C-value per species was calculated. The C-value for P. orbignyanus was estimated assuming that 1 pg = 978 Mb (Doležel et al., 2003). Boxplots were generated using Plotly (Plotly Technologies Inc. Collaborative data science. Montréal, QC). Because most data sets were not normally distributed (according to the Shapiro-Wilk test), C-value data were analyzed with the Mann-Whitney test, comparing Pleuronectiformes with each order using JASP (version 0.16, JASP Team, 2021).
Gene features size analysis at whole genome scale in teleost orders
Complete genome annotations were retrieved from ENSEMBL in gff3 format. We selected species in orders phylogenetically close to Pleuronectiformes. A total of twenty seven species representatives from ten fish orders were used: Anabantiformes (Anabas testudineus, Betta splendens, Mastacembelus armatus), Carangiformes (Echeneis naucrates, Seriola dumerili, Seriola lalandi dorsalis), Cichliformes (Astatotilapia calliptera, Oreochromis niloticus, Pundamilia nyererei), Cypriniformes (Cyprinus carpio, Danio rerio, Sinocyclocheilus rhinocerous), Cyprinodontiformes (Fundulus heteroclitus, Kryptolebias marmoratus, Poecilia formosa), Perciformes (Cyclopterus lumpus, Gasterosteus aculeatus, Sander lucioperca), Pleuronectiformes (Cynoglossus semilaevis, Paralichthys orbignyanus female, Scophthalmus maximus), Salmoniformes (Hucho hucho, Oncorhynchus mykiss, Salmo trutta), Labriformes (Labrus bergylta) and Tetraodontiformes (Takifugu rubripes, Tetraodon nigroviridis). See Supplementary Table Ss1 for more details. Using a tailored python script, we retrieved the size of each gene (genomic locus) by calculating the difference between the final and initial base coordinates. For each transcript model at each locus, we calculated the number of exons and the size of each exon. As well as the transcript size by calculating the sum of the exon sizes. Finally, intron sizes were inferred by calculating the difference between the first base coordinate of an exon and the last base coordinate of the preceding exon.
The size distribution of exons and introns at the genome level were plotted using the kernel density estimation of the Seaborn package (parameters: gaussian kernel, band width 0.1 for exons and 0.01 for introns, and grid size of 1000 for exons and 5000 for introns). Boxplots for the size distributions were plotted using Plotly.
Results
Genome annotation
After filtering, the remaining reads accounted for a coverage of more than 35 X-fold on each genome and were retained for assembly (Table 1).
Genome coverage for each library. Genome sequencing (post-filtering)
The final assemblies were around 524-538 Mb, with the pool being the less fragmented (Table 2).
Black flounder assembly statistics
To verify annotation completeness, we checked for BUSCO (Waterhouse et al., 2018) orthologues in Actinopterygii DB, where 3459 (3252 complete and 198 fragmented, 94,7% of the selected database) were identified in the combined assemblies (for more information see Table 3).
Gene product annotations results, GeneId, Exonerate, Infernal (with RFAM, DB, tRNAScan-SE and RNAmmer
Transposable element identification and quantification
RepeatMasker showed large differences in the frequency of repetitive elements (REs) between fish species. Fig. 1 compares TEs percentage and proportion among different fish genomes. In general, lineages with larger genome sizes had a higher frequency of REs (Fig. 1). In the same line, small genomes such as T. nigroviridis and Pleuronectiformes (especially P. orbignyanus) presented a very low frequency of REs. Among the Pleuronectiformes, the genome of P. orbignyanus has a lower proportion of DNA transposons and a higher proportion of long (LINEs) and short (SINEs) interspersed nuclear elements. We also studied the relationship between genome size with different genome elements, showing strong correlation with TEs percentage (r=0.847; p<0.001), total intron size (r=0.939; p<0.001) and total exon size (r=0.807; p<0.001) (Fig. 2).
Contents of repetitive elements in fish genomes. Repetitive elements (REs) such as DNA transposons, long and short interspersed nuclear elements (LINEs and SINEs, respectively) and long terminal repeats (LTRs) are shown both as total percentage in genome (A) or percentage relative to repetitive elements content (B). A taxonomy tree for the species analyzed is shown below. Species: Anabantiformes (Anates: Anabas testudineus, Betspl: Betta splendens, Masarm: Mastacembelus armatus), Carangiformes (Echneu: Echeneis naucrates, Serdum: Seriola dumerili, Serlal: Seriola lalandi dorsalis), Cichliformes (Astcal: Astatotilapia calliptera, Orenil: Oreochromis niloticus, Punnye: Pundamilia nyererei), Cypriniformes (Cypcar: Cyprinus carpio, Danrer: Danio rerio, Sinrhi: Sinocyclocheilus rhinocerous), Cyprinodontiformes (Funhet: Fundulus heteroclitus, Krymar: Kryptolebias marmoratus, Poefor: Poecilia formosa), Perciformes (Cyclum: Cyclopterus lumpus, Gasacu: Gasterosteus aculeatus, Sanluc: Sander lucioperca), Pleuronectiformes (Cynsem: Cynoglossus semilaevis, Parorb: Paralichthys orbygnanus, Scomax: Scophthalmus maximus), Salmoniformes (Huchuc: Hucho hucho, Oncmyk: Oncorhynchus mykiss, Saltru: Salmo trutta), Labriformes (Labber: Labrus bergylta) and Tetraodontiformes (Takrub: Takifugu rubripes, Tetnig: Tetraodon nigroviridis).
Correlation between genome size (Gb) and TE content (%) of fish.
Gene family clustering
Gene families were clustered according to their similar structures into three different flatfish families; Scophthalmidae (S. maximus), Paralichthyidae (P. orbignyanus and P. olivaceous), Cynoglossidae (C. semilaevis), and D. rerio using OrthoMCL. A total of 11,870 putative specific genes among the five species and 1677 putative specific gene families were identified among the four flatfish species included in the analysis (Fig. 3).
Venn diagram showing orthology in the three flatfish species available in Ensembl and zebrafish. Protein Orthologs were calculated using OrthoMCL (Li et al., 2003) with default parameters.
C-value comparison between Pleuronectiformes and other fish orders
Using publicly available C-value data, we performed a comparative analysis of haploid genome sizes in fish (Fig. 4, Supplementary Fig. S1). We used data from 1504 species belonging to 73 orders (according to the NCBI taxonomy database). The estimated C-value of P. orbignyanus is 0.56 pg, which corresponds to a small fish genome size (5th percentile = 0.61 pg). The average C-value for Pleuronectiformes is 0.734 pg, and the only order with a significantly lower average C-value is Tetraodontiformes (0.618 pg, P < 0.001). Gerreiformes, Ophidiiformes, Uranoscopiformes, Osmeriformes, and Chaetodontiformes also have lower average C-value than Pleuronectiformes, but statistical analysis did not reveal a significant difference. Eight of 49 species (16.3%) of Pleuronectiformes had C-values in the 5th percentile range. In addition, 51.5% of Tetraodontiformes species (34/66) had C-values in the 5th percentile, consistent with the fact that this order had the lowest mean C-value. Other orders represented in this 5th percentile range are: Gerreiformes 2/4 (50%), Uranoscopiformes 2/4 (50%), Syngnathiformes 14/32 (43.75%), Ophidiiformes 2/5 (40%), Anabantiformes 5/28 (17.86%), Chaetodontiformes 2/15 (13.3%), Spariformes 5/39 (12.82%), Istiophoriformes 1/8 (12.5%), Carangiformes 3/26 (11.54%), Blenniiformes 5/44 (11.36%), Perciformes 8/95 (8.42%), Gobiiformes 3/40 (7.5%), Gadiformes 1/20 (5%), Acanthuriformes 1/23 (4.35%), Siluriformes 3/111 (2.70%), Lutjaniformes 1/37 (2.70%), Cypriniformes 8/394 (2.03%) and Centrarchiformes 1/51 (1.96%).
Haploid genome size across fish based on C-value. Boxplots representing C-values of 34 fish orders (top), organized by their taxonomic relationships (bottom). Y-axis represents C-values in log scale. For boxplots: horizontal bar, median; dashed lines, mean and standard deviation; filled circles, outliers; open circles, suspected outliers. C-value for P. orbignyanus, red colored circle. C-value means represented by boxes colored from yellow (highest) to purple (lowest). Analysis of unpaired t-test of P. orbignyanus versus each order are shown (****, P < 0.0001; ***, P < 0.001; **, P < 0.01; ns, not significant).
We additionally analyzed the C-values of all Pleuronectiformes species (Fig. 5). The C-value of P. orbignyanus is one of the smallest in the order (Pleuronectiformes 10th percentile value = 0.55 pg), with three members of the family Paralichthyidae having lower C-values (Pseudorhombus jenynsii 0.54 pg, Paralichthys dentatus 0.53 pg, and Pseudorhombus arsius 0.49 pg). Finally, two other species from different families had smaller C-values: Rhombosolea tapirina 0.55 pg (Rhombosoleidae family), and Pleuronectes platessa 0.39 pg (Pleuronectidae family).
Pleuronectiformes species C-values. The median and the 10th and 90th percentile values are shown. C-value for P. orbignyanus shown in red. Other two species (C. semilaevis and S. maximus) with sequenced genomes are also highlighted. Taxonomy from NCBI’s Common Tree for Pleuronectiformes shown on the left.
Taken together, these results indicate that low genome size is a trait of Pleuronectiformes, and particularly in species from Paralichthyidae (including P. orbignyanus).
Whole genome scale analysis of gene features size in teleost fish
To find a possible explanation for the relatively small size of the P. orbignyanus genome, we investigated whether this may be associated with major genetic feature sizes, e.g., smaller intron sizes. To this end, we determined the coordinates of all exons and introns for each gene model in the genome of P. orbignyanus and other twenty fish species. Using these data, we determined the sizes of all exons and introns, and inferred the size of the genes (exons sizes + introns sizes) and the size of the transcripts (exons sizes). Since it is possible to identify more than one gene model per locus, we also determined the number of exons (and introns) per gene model (Fig. 6). Remarkable, there is a reduction of the overall gene and transcript size in P. orbignyanus, compared to every other species analyzed here (Fig. 6A and 6B), including species from Perciformes (G. aculeatus) and Tetraodontiformes (T. nigroviridis), which also show in general low C-values. Both gene and transcript size distributions are shifted to lower sizes in P. orbignyanus, observed in quantile analysis (Supplementary Fig. S2 and Supplementary Table Ss2-3). Interestingly, the shift towards smaller sizes in P. orbignyanus compared to T. nigroviridis is more evident for transcript sizes. To discard an effect of differential untranslated regions sizes, we confirmed a shift towards smaller protein sizes in P. orbignyanus (Supplementary Fig. S3). In addition, we found that the number of exons and introns per gene model (Fig. 6C) in flounder is the lowest in the genomes analyzed (5.88 exons per model). Note that the total number of genes in P. orbignyanus (21531) is comparable to other species in Pleuronectiformes (21811 bp in C. semilaevis and 21447 in S. maximus). However, the transcript number is remarkably lower in P. orbignyanus compared to C. semilaevis and S. maximus (25262, 34435 and 35543, respectively) (Fig. 6D and Supplementary Table Ss2-3).
Boxplots representing whole genome distribution of genomic loci (A) and transcript models (B) sizes in fish. Average represented by diamonds, and standard deviation in dotted lines. (C) count of exons and introns in whole genomes. Average and SEM is shown. Six letter code for the species analyzed as described in Fig. 1.
Next, we studied the introns and exons size distribution at whole genome scale in fish genomes. Generally, we found that differences between average and median are smaller for exon sizes when compared to intron sizes, suggesting a more compact distribution of the former. Note also that tails towards higher feature size are larger for introns, suggesting that variability in intron size is more permissive towards larger introns rather than larger exons (Fig. 7).
Size distribution of whole genome set of introns (left) and exons (right) for 27 species in 10 fish orders. In x-axis the size in bp is shown (log10 scale). KDE plots represent probability distribution of sizes (y-axis). Below each KDE plot, the boxplots represent the median (line) and quartiles 25 and 75%, whereas whiskers represent the upper and lower bounds. Mean (◆) ± s.d (dashed lines) are shown. Outliers set is shown with dots. Six letter code for the species analyzed as described in Fig. 1.
In general, we did not find major differences in the distribution of exon sizes between species. Median distribution values ranged from 119 bp (T. nigroviridis) to 132 bp (F. heteroclitus), while interquantile values 75th-25th difference (IQR) span from 84 bp to 105 bp (S. rhinocerous and P. formosa, respectively). Among Pleuronectiformes, P. orbignyanus (median size value = 127 bp, IQR = 95 bp) has a smaller number of large exons, with the shift occurring more evidently at the 80th quantile (Fig. 7, Supplementary Fig. S4 and Supplementary Table Ss4-5). This shift in large exon size is also observed in species with small genome size, such as T. nigroviridis and G. aculeatus.
Greater variation among species is observed in intron sizes (Fig. 7, Supplementary Fig. S5 and Supplementary Table Ss4-5). Median size values range from 121 bp (T. nigroviridis) to 1139 bp (D. rerio), and IQR values are highly variable, ranging from 286 in T. nigroviridis to 2602 in D. rerio. The largest difference in Pleuronectiformes is observed for large introns, with very large introns (>10000 bp) being extremely rare in P. orbignyanus (Fig. 7, see insert). The 99th quantile for this species is 7311 bp, whereas in other Pleuronectiformes it is 16966 bp (C. semilaevis) and 19470 bp (S. maximus). The 99th quantile in other orders is highly variable: in T. nigroviridis, the value of the 99th quantile is 6002 bp (which is the lowest value) and increasing to 40237 bp in D. rerio.
DISCUSSION
Fishes are the largest group of vertebrates, with more than 35,000 species (Fricke et al., 2020). About 20 years ago, the first complete genome of the Japanese pufferfish, Takifugu rubripes, was sequenced (Aparicio et al., 2002). Since then, the number of fish species sequences has greatly increased, largely due to advances in sequencing technologies and assembly algorithms (Ravi & Venkatesh, 2018). According to NCBI, nearly 3% (915) of all fish genomes have been sequenced as of August 2022. However, these numbers will soon be outdated due to large-scale sequencing of fish species genomes. For example, the 10,000 Fish Genomes Project (Fish10K) (Fan et al., 2020) aims to sequence and obtain reference genomes from representative fish species and has recently begun using long-read sequencing and Hi-C technology for at least one representative species from all families to obtain better quality reference genomes. All the information obtained from Fish10K, together with genome sequences from other laboratories (such as the one presented in this study), will help improve breeding programmes and promote sustainable aquaculture in the future, as well as perform genome editing and genomic selection (Lu & Luo, 2020).
Using next-generation sequencing (NGS) technologies, we were able to sequence, assemble, and annotate the black flounder genome. The black flounder genome was sequenced with high coverage and 25,231 protein-coding genes were identified. This is slightly more than the 21,787 protein-coding genes discovered in Japanese flounder (Shao et al., 2017). The estimated genome size of black flounder is ∼538 Mbp, similar to the genome size of other flatfish species: Japanese flounder (534 Mb) (Shao et al., 2017), Turbot (568 Mb) (Figueras et al., 2016), Spotted halibut (556 Mb) (Zhao et al., 2021), ∼15% bigger than the tongue sole genome (477 Mb) (Chen et al., 2014), and ∼11% smaller compared to the Senegalese sole genome (∼612 Mb) (Manchado et al., 2016). It is recommended to use multiple complementary methods to assess the genome quality (Gurevich et al., 2013). In this study, we assessed the N50 sizes of contigs and scaffolds and the completeness of the genome using KOG and BUSCO (Benchmarking Universal Single Copy Orthologs). The results of quantitative measurements to determine assembly completeness using BUSCO showed a high percentage of conserved orthologs in turbot (Xu et al., 2020). High BUSCO values (94.7 %) were observed in Actinopterygii using the combined assemblies, to compensate for the lack of long reads. Future transcriptome and long reads sequencing studies are therefore required to provide a comparative, more comprehensive and quantitative overview of the level of completeness achieved.
Figure 3 shows an overlap of a total of 1677 putative orthologous groups exclusive for the three different flatfish families (Scophthalmidae, Cynoglossidae, and Paralichthyidae). It is important to note that a recent comparative genome analysis has examined the origins of flatfish body structure from an evolutionary perspective using the genomes of eight new species from fourteen families of Pleuronectiformes (Lü et al., 2021). This study will help to perform new comparative analyses in the future, including South American black flounder, once the annotated genomes become publicly available.
In this study, we performed a comparative analysis of 27 flatfish species in 10 orders to understand the small size of the black flounder genome. Genome size varies from very small genomes as in Tetraodon nigroviridis (∼350 Mb) (Neafsey and Palumbi, 2003) to large genomes as in Salmo salar (2967 Mb) (Yuan et al., 2018). As expected, the comparative analysis revealed that black flounder has one of the smallest genomes among the compared species, which is consistent with the results of previous studies in which flatfish genome sizes were among the smallest of all teleosts (Figueras et al., 2016; Xu et al., 2020). This observation is also consistent with the analysis of fish C-values (Figures 4 and 5). Indeed, the genomes of B. splendens (Anabantiformes), G. aculeatus (Perciformes, suborder Gasterosteiformes), C. semilaevis (Pleuronectiformes), T. rubripes, and T. nigroviridis (Tetraodontiformes) in the selected data set belong to groups with typically small genome size according to the average C-value (Figure 4). In addition, several C-values of Pleuronectioformes are within the 10th percentile of the distribution, especially in the families Paralichthyidae (including black flounder), Rhombosoleidae and Pleuronectidae. The large genome size in salmonids could be explained by a specific round (called 4R) whole-gene duplication event in this lineage (Lien et al., 2016). Although genome size diversity in teleosts could be the cause of the tremendous diversity of morphology, ecology, and behavior in this group (Volff, 2005), the origin of this dispersion is still unknown.
Genome size is not only affected by the teleost-specific rounds of whole-genome duplications (3R and 4R) (Meyer and Van de Peer, 2005). There are also other important events that seem to be involved, such as changes in the proportion of REs in genomes (Yun et al., 2018). Most of these REs come from specific sequences that replicate and move through the genome, called transposable elements (TEs) (Bourque, 2009). In animals, there is a correlation between genome size and the proportion of TEs (Canapa et al., 2015). Variation in genome size has been shown to be related to the amount of repetitive DNA in eukaryotic species (Kidwell, 2002), and this relationship has been clearly demonstrated in teleost fishes (Chalopin et al., 2015). The abundance of TEs varies with genome size and position in the fish tree of life (Shao et al., 2019). In this work, we compare the proportion of repetitive elements (REs): DNA transposons, long and short interspersed nuclear elements (LINEs and SINEs, respectively), and long terminal repeats (LTRs) in 27 fish species. In black flounder and the order of Pleuronectiformes in general, the proportion of all REs was low, which is probably one of the reasons for the small size of the black flounder genome. The low proportion of all REs contrasts with the high levels observed in zebrafish, as also noted in other studies (Lü et al., 2021). Thus, black flounder is consistent with the compact genomes observed in other flatfish genomes, with ∼5% TEs, slightly more than the 3% in pufferfish (Aparicio et al., 2002).
The number of exons and introns could also explain the genome size in teleost fish. To this end, the coordinates of all exons and introns in the genomes of twenty-one fish species, including P. orbignyanus, were determined to determine whether the small size of the P. orbignyanus genome was due to this. Results showed that P. orbignyanus has smaller gene and transcript sizes compared to all fish species studied, including groups with small genomes such as the Tetraodontiformes (Binerd et al., 2001) and some Perciformes (e.g., G. aculeatus) (Reid et al., 2021). Flow cytometric analyses revealed that in four pufferfish species of the family Tetraodontidae, genome size varies between 0.38 and 0.82 pg, whereas the sister family Diodontidae is larger (0.8-1 pg), likely due to DNA loss during divergence of the two families during evolution (Noleto et al., 2009). This is consistent with our study, which showed that all pleuronectiformes generally have a small genome size (0.3-1.1 pg).
In eukaryotes, intronic DNA is the major component of genes and genomes and plays a key role in gene regulation, and intron size is important from an evolutionary perspective (Zhang and Edwards, 2012). In teleosts, genome size and intron size are closely related, with intron size reflecting genome size (Jakt et al., 2022). To further investigate the origin of the small genome size of black flounder, we examined the distribution of total genome size of introns and exons for 27 species in 10 fish orders. Our analyses show that the mean and median exon sizes are smaller compared to intron sizes, suggesting a more compact distribution, as observed in other Pleuronectiformes species (Robledo et al., 2017). As observed in previous studies, no differences in exon sizes were detected between species (Li et al., 2017). Knowledge of exon loci is important for phylogenomic analyses (Hughes et al., 2021).
Based on the analyzed features, we concluded that the most important components that could be responsible for the reduced flounder genome are (i) the low frequency of repetitive elements, (ii) the reduced transcript size, which is most likely due to a lower number of exons in the whole genome and probably has a lower impact, (iii) the lower number of very large introns. The last two components (ii and iii) have a lower value than in other species with similar C-values, suggesting that this may be a novel genome reduction strategy.
In summary, in this study we generated a genome assembly of black flounder (Paralichthys orbignyanus). The results show a reduced genome size of flounder, and we have demonstrated that this is related to reduced transcript size due to a reduction in exon and intron sizes. Finally, our study also serves to investigate a strategy of genome reduction in teleost fishes.
FUNDING INFORMATION
This study was funded by projects grants by the Agencia Nacional de Promoción Científica y Tecnológica (ANPCYT, Argentina) to ASM (PICT-2017-2839) and partially funded by JV (MPCUdG2016/130).
AVAILABILITY OF SUPPORTING DATA
This sequencing data has been deposited at the National Center for Biotechnology Information (NCBI) where the Bioproject Accession ID is PRJEB36690.
DISCLOSURE STATEMENT
The authors have nothing to disclose.
Supplementary Fig S1
Correlation plots for C-values (as per Animal Genome size database or estimated from sequenced genome, when available, in x-axis) vs feature count (top, log scale), feature median size (center, bp) and feature 99th percentile size (bottom, bp). Features analyzed are Genes, Transcripts, Intron and Exons (in columns from left to right). Data corresponding to P. orbignyanus highlighted in red.
Supplementary Fig S2 Quantile_gene-transcript
For each species, gene (left) and transcript (right) size distribution nth percentiles (n 1, 5, 10, 20, 25, 30, 40, 50, 60, 75, 80, 90, 95 and 99) were computed (shown in y-axis size, bp in log scale). P. orbignyanus data (red) is shown in all plots for comparison. Six letter code for the species analyzed as described in Fig. 1.
Supplementary Fig S3 boxplot_protein
A. Complete fish proteome size distribution. Horizontal bar shows median, whereas dotted lines represent standard deviation, and diamonds represent mean value. Outliers are shown as filled circles. Six letter code for the species analyzed as described in Fig. 1. B. Kernel density estimation plot of protein size distribution in Pleuronectiformes (Cynsem, Scomax and Parorb), plus representatives from Tetraodontiformes (Tetnig) and Cypriniformes (Danrer).
Supplementary Fig S4 Quantile_exon_intron
For each species, exon (left) and intron (right) size distribution nth percentiles (n 1, 5, 10, 20, 25, 30, 40, 50, 60, 75, 80, 90, 95 and 99) were computed (shown in y-axis size, bp in log scale). P. orbignyanus data (red) is shown in all plots for comparison. Six letter code for the species analyzed as described in Fig. 1.
ACKNOWLEDGEMENTS
We thank the Laboratorio de Maricultura del Instituto Nacional de Investigación y Desarrollo Pesquero (INIDEP), Argentina, and E. Aristizabal for kindly providing tissue samples. The authors thank J.J. Boccanfuso and E. Ricci for their help in collecting fish samples or transporting them to Europe; M.B. Gomes Pardo for DNA extraction; and Prof. P. Martinez Portela for helpful comments on this manuscript. We would like to acknowledge the support of the Centro de Supercomputación de Galicia (CESGA) in the completion of this work. Finally, this article is dedicated to our colleague and friend Nicolás Stocchi, who recently passed away.