Haploid-resolved and chromosome-scale genome assembly in Citrus unshiu and its parental species, C. nobilis and C. kinokuni

Citrus, a member of the Rutaceae family, is a widely cultivated crop with numerous cultivars. In Japan, citrus fruits account for a significant portion of agricultural production. Although several new citrus varieties have been developed through conventional breeding programs, satsuma mandarin remains the dominant cultivar. In this study, chromosome-scale and haploid-resolved reference genome sequences of satsuma mandarin (Citrus unshiu Marc) and its parental varaieties, kishu mandarin (C. kinokuni hort. ex Tanaka) and kunenbo mandarin (C. nobilis Lour. var. kunip Tanaka) were generated using long-read sequencing and Hi-C technologies. The comparison of haploid and unphased genomes revealed structural differences between them, indicating distinct regions in each haploid. In addition, genetic linkage maps were constructed, and genetic and physical distances were compared. The results showed variations in polymorphism density across different regions of the chromosomes. Together, the obtained results provide valuable insights into the genomic characteristics and structural variations of satsuma mandarin and related citrus varieties. These insights will lead to the further elucidation and improvement of citrus cultivars through genome breeding strategies.


Introduction
Citrus is a member of the family Rutaceae, which contains 1,900 species across 160 genera 1 . Citrus fruits are among the major cultivated crops worldwide. In Japan, the value of citrus fruit production was approximately 201 billion yen, ranking third among Japanese nonlivestock agricultural products 2  Satsuma mandarin offers many superior characteristics that drive its cultivation and consumption: it is seedless, peels easily, matures early, and resists disease, while its productivity level is high and stable. Due to the long history of satsuma mandarin cultivation in Japan, numerous bud mutation cultivars and nucellus mutation cultivars have been developed. More than 150 mutant cultivars of satsuma mandarin, which feature early maturation, late maturation, high sugar content, and so on, are registered in the cultivar registration database of the Ministry of Agriculture, Forestry and Fisheries in Japan. Because these mutant cultivars possess the key traits to improve citrus cultivars, it will be important to determine the genomic regions that cause mutation in order to promote genome breeding.
Satsuma mandarin is considered to have originated in Japan prior to 1600 AD 3 .
Several attempts to explore the origins of satsuma mandarin have been made based on the genotyping of molecular DNA markers. Fujii et al. (2016) reported that satsuma mandarin is a hybrid between the seed parent of kishu mandarin (C. kinokuni hort. ex Tanaka) and the pollen parent of kunenbo mandarin (C. nobilis Lour. var. kunip Tanaka) by trio analysis with nuclear single nucleotide polymorphism SNP markers and chloroplast DNA markers 4 . Using simple sequence repeats (SSR) marker analysis, Shimizu et al. (2016) 5 subsequently reported that kunenbo mandarin is an offspring of kishu mandarin crossed with an unidentified seed parent, indicating that satsuma mandarin is likely a back-crossed progeny of kishu mandarin.
To elucidate the molecular mechanisms underlying agriculturally important traits in satsuma mandarin and to transfer the superior traits to a new breeding cultivar, Kawahara et al.
(2020) reported the genome sequence of satsuma mandarin using a hybrid de novo assembly 4 of Illumina and PacBio sequence data and developed the Mikan genome database (MiGD) 6 .
The assembled genome of satsuma mandarin is 346 Mb and is predicted to possess 41,489 protein-coding genes in the draft genome sequences, with 9,642 specific genes not found in the genome of clementine mandarin (C. clementina hort. ex Tanaka). This sequence information and MiGD facilitate structural comparisons between satsuma mandarin and other citrus varieties, enabling the development of molecular DNA markers for Marker-Assisted Selection (MAS) in the Japanese breeding program. Additionally, they support cultivar identification technology to prove infringements on breeding rights for new cultivars [7][8][9] .
On the one hand, the present unphased genome sequence likely lacked important structures unique to satsuma mandarin under the hybrid de novo assembly process using both short (Illumina) and long (PacBio/Nanopore) read information referring to the clementine genome sequence 10  Here, we applied long-read sequencing (PacBio Hifi) and Hi-C technologies to generate chromosome-scale reference genome sequences of satsuma mandarin. Kishu mandarin and kunenbo mandarin have also been sequenced to obtain precise haplotype information through parent-offspring trios phasing analysis. The phasing accuracy of the newly assembled satsuma mandarin genome sequence was measured by comparing the genetic linkage maps generated by the F 1 population obtained from the cross between kishu mandarin and kunenbo mandarin. Chromosome-scale reference genome sequences of satsuma mandarin can serve as a valuable resource for the Japanese breeding population, enabling accurate detection of structural variants among different mutant lines.

Materials and Methods
Plant materials 6 The genetic sources for genome sequencing were "Miyagawa wase", one  (Table S1).
For Hifi reads, DNAs were extracted using the Genomic-tips Kit (Qiagen, Hilden, Germany  20 . An Omni-C library was prepared by using an Omni-C TM kit according to the Omni-C TM Proximity Ligation Assay nonmammalian Samples Protocol version 1.3 (Dovetail Genomics, Scotts Valley, CA, USA). Library sequencing was performed using the DNBSEQ-G400RS platform with a read length of 150 nt (Table S1).

Genome sequence assembly
The CKI and CKU genome sizes were estimated based on kmer-frequency analysis with short reads using Jellyfish ver. For the genome assembly of CKI and CKU, a Hi-C integrated assembly approach was employed using Hifiasm v0.16.1. This assembly method incorporates both the Hifi reads and Omni-C reads.
To create chromosome-scale scaffolds, the CUN unphased contigs assembled by Hifiasm v0.13 were aligned on the C. Clementina genome (Cclementina_182_v1) 10 . Contigs assembled by other methods were aligned to the unphased chromosome-levels scaffolds of CUN. The quality of the assembled sequences was assessed by benchmarking universal single-copy ortholog (BUSCO) sequences using BUSCO v3.0 24 .

Linkage map construction
A total of 96 F 1 progenies of the KKN mapping population were used to construct a linkage map to compare the physical and genetic distances on the CUN genome. Genomic DNA of these materials was isolated from the fully expanded fruit and peel tissues using the

Iso-Seq analysis
Total RNAs was extracted from eight tissues of CUN by using the RNeasy Mini kit (Qiagen) for ISO-seq analysis. Libraries were constructed by using Iso-Seq™ Express Template Preparation for Sequel® and Sequel II Systems (Pacific Biosciences) and sequenced using the Sequel II system. The obtained reads were clustered using the Iso-Seq 3 pipeline implemented in SMRT Link, mapped on the unphased sequence of the 'Miyagawa Wase' genome with Minmap2 28 , and collapsed to obtain nonredundant isoform sequences using a module in Cupcake ToFU (https://github.com/Magdoll/cDNA_Cupcake).

Genome size estimation
WGS reads were obtained for CKI, CKU, and CUN (Table S1). Of these, total lengths of 82.0 Gb CKI and 65.8 Gb CKU reads were used in this study. The distribution of distinct k-mers (k=17) shows two clear peaks in both CKI and CKU, indicating high heterozygosity in both accessions (Fig. 1). In the case of CKU, the peak with a higher multiplicity value (158) had a lower peak height compared to the peak with a lower value (79).
CKI exhibited the opposite pattern: the peak with a higher multiplicity value (206) had a higher peak height compared to the peak with a lower value (103 CKI and CKU was slightly larger than in those previous reports.

C. unshiu CUN genome assembly
A total length of 15.4 Gb of PacBio Hifi reads was generated from one SMRT cell.

CKI and CKU genome assemblies
Total lengths of 11.1 Gb and 9.5 Gb PacBio Hifi reads were obtained from the CKI and CKU genomes, respectively (Table S1). The estimated read coverages against the 1C genome size were 31.3x in CKI and 25.4 x in CKU. One unphased and two haploid genomes were created for CKI and CKU by Hi-C integrated assembly using Hifiasm v0. 16 These lengths were also less than the estimated genome size of CKU, which was 372.2 Mb.
The completeness of the six assembled sequences varied between 97.7% and 98.3% based on 1 2 the presence of complete BUSCOs.  compared to the other sequences. Chr4 exhibited the least variation in sequence length among the assemblies. It is currently unknown whether these differences in sequence length reflect the chromosome structure or assembly errors. A comparison between chromosomes revealed that Chr3 had the longest sequence length, followed by Chr5.  (Table S4).

Chromosome-level scaffold lengths of the assembled CUN, CKI, and CKU genomes
Both the CKI and CKU linkage maps showed a consistent and minimal discrepancy between the physical positions and linkage positions in the Chr1 (Fig. 3). Furthermore, variants were mapped throughout the entire chromosome, suggesting the presence of

Structural comparison between CUN and CKI or CKU genomes
The genome structures of the two haploid genomes created in CUN were compared with that of the unphased genome (Fig. 4). The CKI and CKU haploids exhibited similar unphased structures in Chr1 and Chr4. In light of the smaller number of inconsistencies between the physical and genetic positions in the linkage maps (Fig. 3), it was considered that 1 5 the genome structures of the CKI and CKU haploid genomes on these chromosomes are similar to each other. On the other hand, chr2, chr5, and chr9 showed higher structural similarity to the CKU haploid than the CKI haploid, with deletions and duplications observed in the CKI haploid. Chr3, Chr7, and Chr8 exhibited different unphased genomes and sequences in both haploids. It was considered that these chromosomes have distinct structures between the two haploids, and that the assembly of genomic sequences derived from both haploids on the chimera likely led to the emergence of divergent regions in each haploid. The CKI haploid and CKI genomes were then compared with the CUN CKI or CKU haploid (Fig. 5). Interestingly, the CKU haploid in CUN and the CKU genomes exhibited larger sequence variations compared to the CKI haploid and CKI genomes. This likely reflects the higher heterozygosity estimated from the results of Jellyfish analysis in the CKU genome ( Fig. 1). For example, in Chr3 and Chr7, CKU hap1 showed a high degree of similarity to the 1 6 CUN CKU haploid, whereas in Chr5, hap2 exhibited a higher degree of similarity than hap1.
This suggests that Chr3 and Chr7 in the CUN CKU haploid were inherited from hap1, while Chr5 was inherited from hap2. Similarly, the two haploids in the CKI genome exhibited varying degrees of similarity to the CKI haploid of CUN with the level of similarity varying across different chromosomes. For example, hap1 showed higher similarity than hap2 in Chr1, 2, 3, and 8, while hap2 exhibited higher similarity than hap1 in Chr4, 7, and 9.

Iso-Seq analysis
A total of 7,556,520 Iso-Seq sequences were obtained from eight different organs of Miyagawa Wase (Table S5). The number of sequences obtained varied per organ, ranging 1 7 from 801,394 to 1,043,903. Analysis using Iso-Seq 3 resulted in 453,646 high-quality (HQ) reads and 161 low-quality (LQ) sequences. These sequences can be utilized for gene prediction and other analyses in the future assembly of genome sequences.

Conclusion
In this study, we performed de novo whole-genome assembly in C. unshiu CUN and its parental accessions, CKI and CKU, resulting in chromosome-scale haploid-resolved genomes. A genome structure comparison revealed that the CKU haploid and genome exhibited greater sequence variation compared to the CKI haploid and genome, reflecting higher heterozygosity. Iso-Seq analysis provided a substantial number of high-quality reads for future genome assembly and gene prediction. The obtained results provide valuable insights into the genomic characteristics and genetic relationships among citrus varieties.

Data Availability
The sequence reads are available from the DNA Data Bank of Japan (DDBJ) Sequence Read