Pangenome of cultivated beet and crop wild relatives reveals parental relationships of a tetraploid wild beet

Most crop plants, including sugar beet (Beta vulgaris subsp. vulgaris), suffer from domestication bottlenecks and low genetic diversity caused by extensive selection for few traits. However, crop wild relatives (CWRs) harbour useful traits relevant for crop improvement, including enhanced adaptation to biotic and abiotic stresses. Especially polyploids are interesting from an evolutionary perspective as genes undergo reorganisation after the polyploidisation event. Through neo-and subfunctionalisation, novel functions emerge, which enable plants to cope with changing environments and extreme/harsh conditions. Particularly in the face of climate change, specific stress and pathogen resistances or tolerances gain importance. To introduce such traits into breeding material, CWRs have already been identified as an important source for sustainable breeding. The identification of genes underlying traits of interest is crucial for crop improvement. For beets, the section Corollinae contains the tetraploid species Beta corolliflora (2n=4x=36) that harbours salt and frost tolerances as well as a wealth of pathogen resistances. The number of beneficial traits of B. corolliflora is increased compared to those of the known diploids in this section (all 2n=2x=18). Nevertheless, neither the parental relationships of B. corolliflora have been resolved, nor are genomic resources available to steer sustainable, genomics-informed breeding. To benefit from the resources offered by polyploid beet wild relatives, we generated a comprehensive pangenome dataset including B. corolliflora, Beta lomatogona, and Beta macrorhiza, as well as a more distant wild beet Patellifolia procumbens (2n=2x=18). Joined analyses with publicly available genome sequences of two additional wild beets allowed the identification of genomic regions absent from cultivated beet, providing a sequence database harbouring traits relevant for future breeding endeavours. In addition, we present strong evidence for the parental relationship of the B. corolliflora wild beet as an autotetraploid emerging from B. macrorhiza.


172
To explore the power of the wild beet genome data and assemblies for deducing and verifying the 173 tetraploid parentage of B. corolliflora, we deployed five different computational approaches. Some of 174 these approaches are based directly on Illumina reads as input data (read-based approaches) and 175 are therefore not dependent on any assembly quality parameters. 176 The similarity of the genome sequences of two species reflects the distance of their relationship. In 177 turn, the similarity of two sequences is reflected by the similarity of their k-mer sets. Essentially, the 178 set of k-mers of a sequence equals a compact representation of that sequence. Additionally, 179 comparing k-mer sets is assumed to be more robust than directly comparing sequences considering 180 assembly errors, e.g. at repetitive regions. The horizontal bar plot ( Figure 3A). This means that B. macrorhiza genes are substantially more often found in a common 266 phylogenetic unit together with the respective B. corolliflora gene, whereas B. lomatogona genes are 267 often found on a separate branch in the phylogenetic tree ( Figure 3B, 3C). 268 To gain more insight, phylogenetic distances in trees in which the B. macrorhiza gene is not clustered 269 with the closest B. corolliflora gene were further investigated. Branch lengths in such trees are 270 particularly small and B. corolliflora, B. lomatogona, and B. macrorhiza sequences of the respective 271 genes are hardly distinguishable with phylogenetic distances of e.g., < 0.0086 (average distance 272 between two sequences in this tree: 0.104197). 273 We combined multi-colour cytogenetics with five computational approaches to elucidate the type of 304 tetraploidy in B. corolliflora and its ancestry. Using all six approaches, we can now confidently exclude related B. macrorhiza genotypes. 309 To define the diploid ancestry of a polyploid is a question that is and has been commonly addressed 310 using cytogenetics (Schmidt et  To convincingly resolve the question of B. corolliflora's tetraploidy, we leveraged five data-driven 315 genomics approaches using our wild beet pangenome dataset. Three approaches are based on k-316 mers, one is based on mapping of synthetic reads and a fifth approach is based on sequences of 317 conserved and orthologous genes. The advantage of the k-mer approaches using reads as input to 318 generate the species-specific sets is that these approaches do not rely on a reference genome 319 sequence and are not dependent on e.g. the identification of homology through computationally 320 expensive (whole genome) alignment approaches (Ondov et al. 2016; VanWallendael and Alvarez 321

2022). 322
For all k-mer based approaches, it is important to take the genome size of the potential parents into 323 account. The k-mer set size is dependent on the genome size and also on the size of the assembly, 324 since the probability of a k-mer occurring just by chance grows with increasing genome sequence 325 size. B. corolliflora has an assembly size of 1,963 Mb and a 21-mer set size of 6.7e8, whereas B. 326 lomatogona and B. macrorhiza have an assembly size of 1,032 Mb and 736 Mb and a corresponding 327 21-mer set size of 4.9e8 and 4.3e8, respectively (see Table 1). For polyploids, the haploid genome 328 size might be more relevant. An additional copy of a genome, e.g. diploid vs. autotetraploid, does not 329 increase the k-mer set size linearly. However, rearrangements, TE expansions, and particularly small 330 mutations occurring after the polyploidisation/hybridisation increase the potential for additional k-331 mers. In addition to the biological genome size, the completeness and therefore the quality of the 332 assembly has similar effects on the k-mer set size. Even though the assembly quality for B. 333 macrorhiza is lower compared to the quality of the long read assemblies, there is a striking signal 334 towards B. macrorhiza for all approaches. 335 The composition of the B. corolliflora 21-mer set ( Figure 2A) shows a higher overlap with the B.
For trio binning, if both investigated species were the actual parental species of the child, it was 345 expected that approximately the same number of reads would be assigned to both supposed parents. 346 If only one of the candidate species was the parent, substantially more reads should be assigned to 347 the designated species. This number depends on the phylogenetic relationship of the second 348 candidate to the child species. Multiple factors, however, may lead to a divergence from these 349 normalised read datasets reflect the true k-mer sets well. Indeed, the difference in the size of the 376 exclusive k-mer sets when using reads is smaller (3.2e8 for B. lomatogona and 2.9e8 for B. 377 macrorhiza). 378 The idea of the k-mer fingerprinting approach is similar to the k-mer set operations method, however, 379 there are two major differences: i) the randomisation introduced in the fingerprinting method can 380 reduce the impact of errors when taking the average over a sufficiently large number of random sets 381 and ii) this approach allows to compare more than two parent candidates simultaneously. The average 382 fingerprint sizes are mainly related to genome size and sequence diversity (Table 3) Combining all our results, cytogenetics and the five computational approaches, a clear pattern 411 towards resolving B. corolliflora's ancestry emerges (Table 4). 412   is a great advantage for wild sea beets since they are almost exclusively found in coastal regions 524 (Romeiras et al. 2016). In such environments, salt stress represents the most significant abiotic stress. 525 The results of our analysis and the mentioned studies suggest that many of the identified salt stress-526 related genes have originated in the sea beet. 527

Conclusion 528
In this study, the pangenome dataset of sugar beet and CWRs was harnessed to get evidence for the 529 parental relationships of a polyploid species and to identify traits relevant for crop improvement. For B. macrorhiza, a short read Illumina assembly was generated, as only a low amount of plant 555 material was available, which was not sufficient for preparing DNA suitable for long read sequencing. 556 High molecular weight DNA was extracted using a previously described CTAB-based method (Siadjeu 557 et al. 2020). DNA extraction as well as Illumina sequencing was performed as previously described 558 (Sielemann et al. 2022). In total, 138 GB read data were generated (Supplemental_File_S5). The 559 reads were trimmed using Trimmomatic (v0.39) (Bolger et al. 2014) as described (Sielemann et al. 560 2022) and the quality was assessed using fastqc (v0.11.9) (Andrews 2020). All trimmed reads were 561 subjected to DiscovarDeNovo (v52488) (run with default parameters and 10 threads; 2 ) (Hopgood 1972). 636 The developed method can also be used to generate random sets of canonical k-mers. To achieve 637 optimal time and space complexity at sampling (without replacement) random k-mers, a specific 638 algorithm was used (sparse Fisher-Yates shuffle) (Ting 2021). 639

-K-mer set operations 640
Similarities and differences between the species-specific k-mer sets were assessed using various set 641 operations. As input for the k-mer set operations method, we first used normalised Illumina read 642 datasets of the potential parent species to achieve a comparable set size. Normalisation was 643 performed with bbnorm (Bushnell) and the parameters k = 21, a target depth of 20 and a minimum 644 threshold of 3 to discard likely erroneous reads. For the child species, the complete set of k-mers 645 based on the read datasets was used. In addition, the set operations were performed using sequence 646 assemblies as input. 647 For each investigated species dataset, the set of distinct canonical k-mers (k = 13, 21, 31) was species-specific k-mer sets, multiple subsets were generated using different set operations. This 650 includes the subset of k-mers present in all species (B. corolliflora, B. macrorhiza, and B. lomatogona), 651 shared among two species as well as the subset of k-mers shared by two species, but not present in 652 the third investigated species. These set operations were performed with KMC tools (v3. with KMC3 using either assemblies or high-quality, normalised (as described above) short reads as 659 input. Then, the set of exclusive k-mers for each of the potential parent species was calculated with 660 KMC tools (i.e. the set of k-mers present in one species, but not in the other). For the child species, 661 a long-read dataset was used. For each read in this dataset, the number of unique canonical k-mers 662 this read shares with the exclusive k-mer set of one of the parent candidate species was counted 663 using the 'k-mer operator' described above. This number is then divided by the number of unique 664 canonical k-mers of the read to get the 'k-mer share' for each potential parent. The average share of 665 exclusive k-mers assigned to the potential parents was calculated over all reads. Only reads, for which 666 at least half of the average k-mer share was assigned to the potential parents, are considered (the 667 other reads contain too few k-mers exclusive to either one of the potential parental species). The goal 668 of this filtering is to exclude reads where the overall signal is too weak to be interpreted reliably. All 669 remaining reads were then classified into four different classes (Supplemental_File_S7): if the number 670 of exclusive k-mers of a specific read is 3x higher for parent A than for parent B, the read was assigned 671 to parent A i) (beige) and ii) vice versa (orange-red) (Supplemental_File_S7). The read was classified 672 the impact of errors and therefore having a closer reflection of the real similarity when using the 684 average over multiplerandom canonical k-mer sets. In general, this approach is similar to k-mer 685 sketching. 686 To select a suitable k, the percentage of distinct canonical k-mers that are present in the dataset of 687 each species was computed for a range of k (14-20). In accordance with Fofanov et. al. (Fofanov et 688 al. 2004), a k was selected for which 5%-50% of all possible unique canonical k-mers were present 689 in all datasets. This ensured that the different species datasets can be distinguished and that 690 erroneous k-mers of low-quality reads do not impact the results. In general, the choice of k is a trade-691 off between a clear signal and computational intensity. In contrast to the previously described k-mer-692 based approaches, for which k=21 was well suitable, here, k=15 was selected based on the criterion 693 by Fofanov et. al. 2004, and KMC3 was used to generate k-mer sets. The 'k-mer operator' was used 694 to build indices for all investigated datasets and to generate 100,000 random sets of size 10,000 695 based on the whole set of all theoretically possible 15-mers. For each random set, the fingerprint, i.e. 696 overlap with the species-specific k-mer set, was calculated. Additionally, the fingerprint intersection, 697 i.e. the overlap of the fingerprint of the child species with the respective fingerprint of each investigated 698 potential parent species, was computed. the sequence was qualitatively assessed so that either a variant was present at a specific position 751 (1), or no variant was detected (0). 752 The average coverage per base as well as the average variance per base was calculated in a sliding 753 window approach. The approximate average gene length in sugar beet is 5 kb (based on the 754 KWS2320ONT v1.0 p1.0 annotation). To ensure high sensitivity and as consecutive conserved 755 regions are later merged into a single window, a window size of 2,500 bp was selected. An example 756 region is shown in Supplemental_File_S10. The shift size of 150 bp means that the first window spans 757 the region from 0 bp to 5,000 bp, whereas the second window spans the region from 150 bp to 5,150 758 bp (Supplemental_File_S9, B). All parameters can be selected by the user depending on the 759 application. 760 As stated above, regions not present in cultivated sugar beet KWS2320 should be associated with 761 low/no coverage in the crop wild relatives. To get these 'zero coverage regions', after the extraction 762 of primary mappings, a maximal coverage of 1% of the mean coverage per contig was set as filter 763 criterion. As short-read assemblies are highly fragmented and short contigs are present, the coverage 764 was normalised in these cases by the value calculated from the whole assembly (for B. vulgaris subsp.