The structure of the tetraploid sour cherry ‘Schattenmorelle’ (Prunus cerasus L.) genome reveals insights into its segmental allopolyploid nature

Sour cherry (Prunus cerasus L.) is an economically important allotetraploid cherry species believed to have evolved in the Caspian Sea and Black Sea regions. How, when and where exactly the evolution of this species took place is unclear. It resulted from a hybridization of the tetraploid ground cherry (Prunus fruticosa Pall.) and an unreduced (2n) pollen of the diploid ancestor sweet cherry (P. avium L.). Some indications implement that the genome of sour cherry is segmental allopolyploid, but how it is structured and to what extent is unknown. To get an insight, the genome of the sour cherry cultivar ‘Schattenmorelle’ was sequenced at ~400x using Illumina NovaSeqTM short-read and Oxford Nanopore long-read technologies (ONT R9.4.1 PromethION). Additionally, the transcriptome of ‘Schattenmorelle’ was sequenced using PacBio Sequel II SMRT cell sequencing at ~300x. The final assembly resulted in a ~629 Mbp long pseudomolecule reference genome, which could be separated into two subgenomes each split into eight chromosomes. Subgenome PceS_a which originates from P. avium has a length of 269 Mbp, whereas subgenome PceS_f which originates from P. fruticosa has a length of 299.5 Mbp. The length of unassembled contigs was 60 Mbp. The genome of the sour cherry shows a size-reduction compared to the genomes of its ancestral species. It also shows traces of homoeologous sequence exchanges throughout the genome. Comparative positional sequence and protein analyses provided evidence that the genome of sour cherry is segmental allotetraploid and that it has evolved in a very recent event in the past.

comparison of genetic position and physical position of up to 1,856 markers of the 184 five genetic sour cherry maps (Table S5a)  whereas the mitochondrial sequence contained 188 genes, 3 rRNAs and 152 tRNAs 189 ( Figure S2). An ab-initio and homology-based gene prediction with 14 reference 190 species was performed (IAA). Based on the homology prediction, thirty-four percent 191 of the proteins showed the highest IAA towards Prunus fruticosa and 17.9% towards 192 P. avium. Only 5.2% of the proteins showed no IAA to any of the used reference 193 datasets used, which was due to ab-initio prediction. The data is summarized in 194  (Table 2). 198 A total number of 26,532 shared transcripts were found between the two subgenomes 199 PceS_a and PceS_f and the genomes of PfeH and PaT. Thirty-eight percent of the P. 200 cerasus proteins had a greater IAA to PfeH, whereas 54% showed a greater IAA to PaT. 201 Eight percent showed an identical IAA to both ancestral species. A larger number of 202 transcripts of both sour cherry subgenomes (Table 2) were assigned to the annotation 203 data set of PfeH. A total of 13,425 transcripts from the PceS_a subgenome and 13,107 204 from the PceS_f subgenome were found in the genome sequences of PfeH and PaT. 205 Seventy-five percent of the pool from the PceS_a subgenome showed a higher IAA to 206 PaT and 17% to PfeH, while 59% from the pool originating from the PceS_f subgenome 207 showed a higher IAA to PfeH and 32% to PaT. 208 209

Identification of syntenic regions and inversions 210 211
The sequences of the two subgenomes PceS_a and PceS_f and the genotypes PaT 212 and PfeH of the two ancestral species P. avium and P. fruticosa were screened for 213 duplicated regions using DAGchainer as previously published for peach (Verde et al. 214 2012). The seven major triplicated regions were found nearly one to one in P. avium but not in P. fruticosa, which lacked regions 4 and 7 corresponding to Verde et al. 216 (2012). P. avium and P. fruticosa seem to derive from the same paleohexaploid event 217 like peach, but with a loss of the fourth and seventh paleoset of paralogs in P. 218 fruticosa. The graphical analysis is summarized in figure S7). 219 220 Thirteen inversions were detected through positional co-linearity comparison between 221 the two subgenomes using the molecular markers from the 9+6k SNP array ( Figure  222 S8). Five inversions were found between subgenome PceS_a and the genome 223 sequence of PaT. Eleven inversions were found between sugenome PceS_f and PfeH 224 (Table S6). By comparing the position of amino acid sequences of orthologous 225 proteins (synteny), we found 21 inversions when comparing PceS_f with PfeH. Only 226 seven were found between PceS_a and PaT and 16 between both subgenomes PceS_a 227 and PceS_f ( Figure S9). 228 229

Detection of de novo homoeologous exchanges 230
For the detection of de novo homoeologous exchanges we used three approaches by 231 comparing inter-and intraspecific %-covered bases (genomic and transcriptomic) 232 and %-IAA between proteins of PceS to PaT and PfeH ( Figure 4, Figure S10, S11). PceS 233 short-reads were mapped against PaT and PfeH and only species specific reads (PaT 234 and PfeH) were filtered into read-subsets. The obtained read-subsets were re-mapped 235 against PceS_a and PceS_f and base coverage was calculated. A total of 1024 regions 236 (100k window) were intraspecific %-covered bases from mapped reads (PceS_a to 237 PaT, PceS_f to PfeH) was < than interspecific %-covered bases from mapped reads 238 (PceS_a to PfeH, PceS_f to PaT) were discovered. In a second approach, translocations 239 between the two subgenomes were localized by a short-read mapping analyses. species groups P. persica and P. mume diverged from the P. 281 yedonensis/P.avium/P.fruticosa group 11.6 Mya. Based on this model, the divergence 282 of the two subgenomes of P. cerasus compared to the genome sequences of PaT and 283 PfeH was estimated with 2.93 Mya and 5.5 Mya respectively ( Figure 5B). 284

Discussion 285
The genome of the economically most important sour cherry 'Schattenmorelle' in 286 Europe was sequenced using a combination of Oxford Nanopore R9.4.1 PromethION 287 long read technology and Illumina NovaSeq TM short read technology. After 288 assignment of the long-reads to the two subgenomes and Hi-C analysis, the final 289 assembly was 629 Mbp. 290 This sequence was used to study structural changes present in the allotetraploid sour 291 cherry genome after its emergence. Therefore, the sour cherry genome sequence was 292  Additional mapping of transcriptomes from six other Prunus species indicates that no 358 major introgression from one of these species occurred. Using 14 reference species, 359 60,123 gene models were annotated. Almost the same number was assigned to the 360 two P. cerasus subgenomes (Table 2). No evidence was found for large introgressions 361 from any of the reference species (Note S2, S3). By comparing the amino acid identity 362 of the proteins of PaT and PfeH with the respective sour cherry subgenome, the 363 identified translocations via read mapping could be confirmed. The majority of the 364 transcripts (51%) could be assigned to the genotypes PaT and PfeH of the two ancestral 365 species P. avium and P. fruticosa ( Figure S6). 5.2% of the transcripts could not be 366 assigned to any of the reference species. Only < 1% of the transcripts could not be 367 assigned to one of the ancestral species. They showed equivalent matches to both 368 species and are probably a product of ab-initio prediction. A total of 49,698 proteins 369 in subgenome PceS_a and 48,576 proteins in PceS_f shared only 13,435 and 13,107 370 proteins with PaT and PfeH, respectively. A total of 75% of the proteins of subgenome 371 PceS_a matched better to PaT compared to PfeH, whereas only 59% of PceS_f mapped better to PfeH than to PaT (Table 2). Subgenome PceS_a seems to be closer to PaT than 373 PceS_f to PfeH. This was confirmed by an evolutionary approach that calculated the 374 separation of the subgenome PceS_f from P. fruticosa 5.5 Mya, while subgenome 375 PceS_a separated from P. avium 2.93 Mya ( Figure 5B). 376 To validate these results, the insertion events of long terminal repeats between P. Assuming that a generation change is to be expected after 10 to 60 years (Besford et 390 al. 1996), this would correspond to a time period of ~ 1 to 6 Mya. The youngest co-391 occurring LTR could be estimated at 1.9 Mya. This suggests that P. fruticosa and P. to genome alignments, and with gene structure information derived from Cupcake 506 transcripts using GeneMarkS-T. This generated a gene set that consists of ab initio 507 and evidence supported predictions. A separate gene set was generated with 508 BRAKER2, which uses protein to generate a gene set. We used the OrthoDB version predictions of each reference species were combined with TSEBRA predictions using 520 the GeMoMa module GAF and subsequently, UTRs were predicted in a two-step 521 process based on mapped Iso-seq and RNA-seq data using the GeMoMa module 522 AnnotationFinalizer (supplements 1.4.5). First, UTRs were predicted based on Iso-seq 523 data. Second, UTRs were predicted based on RNA-seq data for gene predictions 524 without UTR prediction from the first step. An assembly hub for visualization of the 525 Prunus cerasus genome with structural annotation was generated using MakeHub 526 (Hoff, 2019; supplements 2). The functional annotation was performed with the Galaxy 527 subhirtella (SRX14816145). Reads were adapter-and quality trimmed using the 568 software Trim Galore (version 0.6.3, parameters --quality 30 -length 50). Trimmed 569 reads were mapped against the P. cerasus subgenomes PceS_a and PceS_f using 570 STAR (version 2.7.8a, parameter --twopassMode Basic). The subsequent analysis 571 was performed in accordance to Keilwagen et al. (2022). The PceS genome was 572 divided into 250k windows. The percentage of covered bases using RNAseq data of 573 P. cerasus (SRX14816146, SRX14816142, SRX14816138) was estimated at a depth 574 of 1 for each window. The same was done with all other RNAseq data sets. The 575 percentage of covered bases from P. avium (SRX14816143) was subtracted from the 576 percentage of covered bases from P. cerasus (SRX14816146, SRX14816142, 577 SRX14816138). The same was done using the reads of P. fruticosa (SRX14816141). 578 For subgenome PceS_a it is expected that the intraspecific difference for transcripts 579 of data set P. avium (SRX14816143) is lower (close to 0) than the interspecific 580 difference for transcripts of data set P. fruticosa (SRX14816141) and vice versa. 581 Opposite cases indicate potential homoeologous exchanges between the two 582 subgenomes PceS_a and PceS_f and were plotted into a circos plot ( Figure S11). 583 The nucleotide short reads from PceS were mapped against the genomes of the two 584 ancestral species PaT and PfeH. Subsequently, the mapped reads were filtered using 585 samtools for mapped reads in proper pair (-f 3) and primary alignments and not 586 supplementary alignment (-F 2304). Those reads were divided into four groups 587 according to the following criteria: 1. unique match to PaT: 2. unique match to PfeH; 3. 588 match to PaT and PfeH; 4. no match to PaT and PfeH (unique to PceS). The first two 589 separated read sets were then re-mapped against the subgenomes PceS_a and 590 PceS_f. The percentage of covered bases was calculated for a 100 k window. For the 591 subgenomes of Pces, the percentage of intraspecific covered bases (PceS_a to PaT,592 PceS_f to PfeH) should be higher compared to the percentage of interspecific covered 593 bases (PceS_a to PfeH, PceS_f to PaT). The opposite case indicates possible of the Schattenmorelle genome assembly were determined that are uniquely covered 596 by PaT and PfeH filtered read sets. 597

LTR insertion estimation 599
The difference (identity) of left and right LTR was calculated using the script 600 EDTA_raw.pl from the software EDTA version 1.9 (https://github.com/oushujun/EDTA, We would acknowledge the Galaxy Europe, Galaxy USA server administration for 651 support and provision of resources. Special thanks to Eric Lyon for support with 652 Synmap2. 653