Chromosome-level genome assembly of a butterflyfish, Chelmon rostratus

Chelmon rostratus (Teleostei, Perciformes, Chaetodontidae) is a copperband butterflyfish, living in tropical areas. Just as other coral reef fish, it can be cultivated as an ornamental fish, thus providing genome information for this species might help understanding the genome evolution of Chaetodontidae (without previous genomes available) and adaptation/evolution of coral reef fish. In this study, we sequenced and assembled a draft genome of Chelmon rostratus. Using the stLFR linked-read data, we assembled a genome of 638.88 Mb in size with contig and scaffold N50 sizes of 294.41 kb and 2,62 Mb, respectively. Up to 21.47 % of the genome was found to be repetitive sequences and 21,375 protein-coding genes were annotated. Among these annotated protein-coding genes, 20,163 (94.33%) proteins were assigned with possible functions. As the first genome for Chaetodontidae family, the information of these data helpfully to improve the essential to the further understanding and exploration of marine ecological environment symbiosis with coral and the genomic innovations and molecular mechanisms contributing to its unique morphology and physiological features.

prediction approaches, Augustus 25 and GlimmerHMM 26 were used with Danio rerio 1 as the species of HMM model to predict gene models; For homology-based 2 annotation, four homolog species including Pundamilia nyererei, Maylandia zebra,3 Astatotilapia calliptera and Perca_flavescens were aligned against the genome 4 assembly using BLAT software (version 0.36) 27 and GeneWise software (version 5 2.4.1) 28 . 21,375 protein-coding genes were obtained by combining the different 6 evidences using Glean software (version 1.0) 29 . In the final gene models, the average 7 length was 16,183.81 bp, with an average of 10 exons. The average length of coding 8 sequences, exons and introns were 179.10 bp and 1599.48 bp, respectively, similar to 9 that of the other released fish genomes, such as Astatotilapia calliptera, Maylandia 1 0 zebra, Perca flavescens and Pundamilia nyererei [30][31][32] (Supplementary Table 4, 1 1 Supplementary Figure 2). Gene annotation of mitochondria was performed using 1 2 MitoZ software 14 , and 13 protein-coding genes as well as 22 tRNA genes were 1 3

annotated. 1 4
Functions of the annotated protein-coding genes were inferred by searching homologs 1 5 in the databases (KEGG, COG, NR, Swissprot and Interpro) [33][34][35][36][37] . In this way, 18,005 1 6 (84.23%), 7,343 (34.35%), 20,141 (94.23%), 19,114 (89.42%) and 19,313 (90.35%) 1 7 of protein-coding genes had their homologous alignment in InterPro and SwissProt 1 8 databases, respectively. The remaining 1,212 (5.67%) protein-coding genes with 1 9 unknown function might be the specific feature of the Chelmon rostratus genome 2 0 (Supplementary Table 5 and Homo sapiens as outgroup) 24,38-47 . The protein coding genes of the total twelve 2 7 species were clustered into 18,502 gene families using TreeFam 48-50 . Among these 2 8 gene families, 2,301 were single-copy gene families (one copy in each of these 2 species) ( Figure 4B). The 21,375 protein-coding genes of Chelmon rostratus were 1 classified into 13,797 gene families, given an average of 1.41 genes per gene family. 2 Compared to the other species, it was similar to the gene family numbers of 3 Lepisosteus oculatus (13,967) and Gasterosteus aculeatus (13,331), but was quite 4 different from those of Larimichthys crocea (14,724), Homo sapiens (14,578) and 5 Takifugu rubripes (12,554) (Supplementary Table 6). Among the clustered gene 6 families from Chelmon rostratus, 8,190 gene families were common to at least one of 7 the other species, the remaining 52 gene families were unique. Between the four 8 species (Chelmon rostratus, Danio rerio, Takifugu rubripes and Larimichthys crocea), 9 the number of common shared and unique gene was shown in Figure 4C. To 1 0 understand the function of these gene families, we further performed GO enrichment 1 1 with these gene families from Chelmon rostratus, compared with the other 11 species. 1 2 The result reflected that unique gene families from Chelmon rostratus were enriched 1 3 in muscle contraction functions (Supplementary Table 7). Phylogenetic analysis using 1 4 the concatenated sequence alignment of the 2,301 single-copy genes shared by the 1 5 twelve species was performed. The PhyML software (version 3.0) 51 , based on the 1 6 method of maximum likelihood, was used to construct the phylogenetic tree. The split 1 7 time between Chelmon rostratus and Larimichthys crocea was estimated to be ~92 1 8 million years ago ( Figure 4A). Based on the similarity of the protein sequences, 483 1 9 syntenic blocks were identified by using McScanX software (version 0.8) 52 2 0 (Supplementary Table 8). The time of the duplication and divergence event in these 2 1 species was calculated based on the distribution of synonymous mutation rate for the 2 2 gene pairs in the paralogous syntenic blocks, indicating that whole-genome 2 3 duplication (WGD) event was not detected in Chelmon rostratus genome ( Figure 3).

4
The expansion and contraction of the gene family analysis may reveal the 2 5 evolutionary dynamics of gene families thus provide the clues for understanding the 2 6 diversity of different species. It is often inferred from the number of genes in the gene 2 7 family and the phylogenetic tree. In our study, we used the CAFÉ (version 2.1) 53 2 8 software to analyze the expansion and contraction of clustered gene families ( Figure  2  9 4D). As a result, a total of 18,498 gene families from the most recent common 3 0 ancestor (MRCA) have been identified. Compared to the recent common ancestor 1 between Chelmon rostratus and Larimichthys crocea, 793 gene families were 2 expanded and the majority of the expanded gene families were found to be involved 3 in synapse organization. (Supplementary Table 9). On the other hand, there was 2,962 4 gene family contracted involved in immune system process (Supplementary Table  5 10). 6

Data Records 7
Raw reads from BGISEQ-500 sequencing are deposited in the CNGB Nucleotide 8

Technical Validation 1 2
To evaluate the genome assembly, we aligned sequencing data which we filtered 1 3 previously using SOAPaligner (version 2.2) 10 and found that 90.76% could be 1 4 mapped back to the assembled genome. we also calculate its GC depth to rule out 1 5 possible biases during sequencing or possible contaminations. We identified the 1 6 average GC contents of this genome to be ~41.0% and we found a continuous GC was estimated with Benchmarking Universal Single-copy Orthologs (BUSCO, 2 0 version 3.0.1) 54 . BUSCO analysis showed that 98.1% (2579) of core genes were 2 1 found in our assembly with 2,512 (97.1%) were single copy gene and 27 (1.0%) were 2 2 duplicated (Supplementary Table 11), indicating a good coverage of the genome. To validate the quality of predicted gene sets, we also assessed the completeness using