Large genomic variants reveal unexplored intraspecific diversity in Brassica rapa genomes

Traditionally, reference genomes in crop species rely on the assembly of one accession, thus occulting most of intraspecific diversity. However, rearrangements, gene duplications and transposable element content may have a large impact on the genomic structure, which could generate new phenotypic traits. Using two Brassica rapa genomes recently sequenced and assembled using long-read technology and optical mapping, we investigated structural variants and repetitive content between the two accessions and genome size variation among a core collection. We explored the structural consequences of the presence of large repeated sequences in B. rapa ‘Z1’ genome versus the B. rapa ‘Chiifu’ genome, using comparative genomics and cytogenetic approaches. First, we showed that large genomic variants on chromosomes A05, A06, A09 and A10 are due to large insertions and inversions when comparing B. rapa ‘Z1’ and B. rapa ‘Chiifu’ at the origin of important length differences in some chromosomes. For instance, lengths of ‘Z1’ and ‘Chiifu’ A06 chromosomes were estimated in silico to be 55Mb and 29Mb, respectively. To validate these observations, we compared using fluorescent in-situ hybridization (FISH) the two A06 chromosomes present in a F1 hybrid produced by crossing these two varieties. We confirmed a length difference of 17.6% between the A06 chromosomes of ‘Z1’ compared to ‘Chiifu’. Alternatively, using a Copy Number Variation approach, we were able to quantify the presence of a higher number of rDNA and Gypsy elements in ‘Z1’ genome compared to ‘Chiifu’ on different chromosomes including A06. Using flow cytometry, the total genome size of 12 Brassica accessions corresponding to a B. rapa available core collection was estimated and revealed a genome size variation of up to 16% between these accessions as well as some shared inversions. This study revealed the contribution of long-read sequencing of new accessions belonging to different cultigroups of B. rapa and highlighted the potential impact of differential insertion of repeat elements and inversions of large genomic regions in genome size intraspecific variability.


INTRODUCTION
Genetic and phenotypic diversity are the drivers of plant evolution and adaptation.
While ecology and population genetics provide important knowledge on the molecular processes underlying the astonishing plant biodiversity, comparative genomic analyses are essential to understand the large-scale genomic variations that can be observed within species and how these structural variations may impact plant phenotypes. Re-sequencing of accessions and construction of pangenomes of plant species such as wheat (Montenegro et al., 2017), maize (Gage et al., 2019), rice (Schatz et al., 2014;Wang et al., 2018;Zhao et al., 2018) or soybean (Lam et al., 2010;Li et al., 2014) have revealed the involvement of structural variations in various agronomically relevant traits. In maize, structural variants are implicated in plant architecture, flowering and disease resistance (Walker et al., 1995;Chia et al., 2012;Lu et al., 2015). Similarly, using pangenomes, numerous structural variations have been revealed in Brassica napus varieties correlating with genes and phenotypic traits such as silique length and seed weight (Gazave et al., 2016), disease resistance (Dolatabadian et al., 2020) as well as flowering time and vernalization (Song et al., 2020). Structural variants identified as underlying flowering time were associated with the insertion of transposable elements in three FLC genes (Song et al., 2020). However, structural diversity in the progenitors of this allopolyploid species have so far been particularly overlooked.
Only recently, the development of Brassica pangenomes has started to reveal the extensive number of structural variants within one of the two diploid progenitors: Brassica oleracea (Golicz et al., 2016) and assessed in regards to the domestication process of the subsequent allotetraploids B. juncea (Paritosh et al., 2019) and B.
napus (Song et al., 2020). These studies highlight the importance of structural variants and gene copy number variation in adaptive phenotypes. While core genes represent 80% of the assembled genomes, each new accession brings around 20% of novel genes. In most of these analyses however, the occurrence and importance of the repeat compartment is undervalued.
Transposable elements form an important internal source of genetic diversity as a result of their ability to create mutations, alter gene expression, and promote chromosomal mispairing (Kidwell and Lisch, 2000). These ancient, ubiquitous, and dynamic components of eukaryotic genomes comprise up to 58% of the allopolyploid genome of B. napus (Song et al., 2020). Transposable elements comprise two classes that have contrasted mode of replication, each divided in several families: Retrotransposons (Class I) transpose via reverse transcription of their messenger RNA and DNA Transposons (Class II) encompassing all other types of TEs (Wicker et al., 2007). Overall, TEs have major impact on genome organization and function, and have been associated with up-regulated genes in response to various abiotic stresses, environmental gradients, reproductive mode (asexuality, selfing vs. outcrossing), interspecific hybrid formation and polyploidy (Wright and Schoen, 1999;Kalendar et al., 2000;Ungerer et al., 2006;Makarevitch et al., 2015;Vicient and Casacuberta, 2017). Several studies from different biological systems indicate that the patterns (position and number of copies) of a single Transposable Element (TE) family can vary within species (Quadrana et al., 2016). However, the study of TEs inserted in complex genomes of various populations is challenging and rely on the amplification of specific TE sequences (for instance using Sequence Specific Amplification Polymorphism transposon display, Parisod et al., 2009) or lowcoverage sequencing strategies (Ferreira de Carvalho et al., 2016). The utilization of long-read sequencing and optical mapping in several varieties of the same species is allowing fine comparative genomic analysis of the differential proliferation of TEs and other repeats. This overlooked component of plant genomes is getting accessible and provides important insights on the structural genomic diversity standing within species.
In the present study, we conducted a fine comparative genomic analysis of two varieties of Brassica rapa 'Chiifu' and 'Z1'. Brassica rapa (AA, 2n = 20) is one of the diploid progenitors of the important allotetraploid oilseed crops, B. juncea (AABB, 2n = 36) and B. napus (AACC, 2n = 38) (U, 1935). Therefore, B. rapa varieties can be used to broaden the genetic diversity of these domesticated species.
Brassica diploid species arose from a Whole Genome Triplication (WGT) that occurred about 15-20 million years ago (Murat et al., 2015). Diploid Brassica are thus considered as paleohexaploids. Thereafter, their duplicated genomes rapidly underwent extensive gene deletion through a fractionation process (Murat et al., 2015). The majority of the genes are tending towards a return to single-copy status (Wang et al., 2011), and overall gene content in Brassica diploid genomes has so far been reduced by about half. Yet, different cultigroups within Brassica species have selected different alleles from the same duplicated genes explaining part of the astonishing and extensive number of morphotypes existing within each species (Cheng et al., 2014(Cheng et al., , 2016bGolicz et al., 2016). Brassica rapa speciation started after the divergence with B. oleracea estimated 2.18 Mya (Li et al., 2017) but its diversification occurred mainly during the domestication process of B. rapa crops, 3500 years BC in the Mediterranean basin with a secondary center of diversification in Asia (Guo et al., 2014). This resulted in genetically (Aissiou et al., 2018) and phenotypically diverse morphotypes (leafy, root or fodder and oilseed types) corresponding to different subspecies including ssp. pekinensis (Chinese cabbage with 'Chiifu') and ssp. trilocularis (Yellow sarson with 'Z1'). Chinese cabbage has originated in central China by natural hybridization between Pak-choi (ssp. chinensis) and turnip rape (ssp. oleifera) 1,200-2,100 years ago (Song et al., 1990;Qi et al., 2017). Yellow sarson originated from India but its precise origin is still unknown. Still, the underlying structural genomic diversity has never been explored within this important crop.
Using established protocols (flow cytometry and cytogenetics) along with state-ofthe-art whole genome sequencing and optical mapping for assembly of complex Brassica genomes, we compared structural rearrangements between two varieties of B. rapa that have contrasted morphotypes, evolutionary history and reproductive mode. Specifically, we identified differential insertion of ribosomal DNA and LTR retrotransposons and reconstructed the evolution of these repeats during B. rapa domestication. Finally, we unraveled wide variation in DNA content among different accessions belonging to B. rapa core collection.

Biological material
Two B. rapa genotypes, B. rapa var. pekinensis cabbage 'Chiifu' (pure line) and B. rapa var. trilocularis 'Z1' (doubled haploid line) as well as their F1 hybrid ('Z1'x'Chiifu') were grown in semi-controlled conditions in a greenhouse. Seven to ten individuals of each B. rapa accession representing biological replicates were harvested for leaves and roots. This material will further be used in flow cytometry and fluorescent in-situ hybridization (FISH) experiments, respectively. In addition, ten accessions representing the different sub-species of B. rapa (Zhao et al., 2010;Cheng et al., 2016b) were grown in a similar environment for assessing genome size variability using flow cytometry. Fresh leaves from three to eight individuals, used as biological replicates, were harvested per accession: from group 1, one accession of ssp. pekinensis, from group 2, one accession of ssp. narinosa; from group 3, one accession of ssp. perviridis and one accession of ssp. nipposinica; from group 4, two different accessions of ssp. oleifera; from group 5, one accession of ssp. trilocularis; from group 6, three different accessions of ssp. rapa. Leaf tissue is then directly processed for flow cytometry analyses (Supplementary Table 1).

Annotations of Transposable Elements
The latest genome assemblies available for B. rapa 'Z1' (Belser et al., 2018) and 'Chiifu' V3 (Zhang et al., 2018) were retrieved from the Genoscope website (http://www.genoscope.cns.fr/plants) and the Brassica database (http://brassicadb.org/) respectively. These genomes were subjected to the same transposable element annotation package REPET V2.5 (Quesneville et al., 2005;Flutre et al., 2011;Hoede et al., 2014). Briefly, the REPET pipeline was used for the detection, classification (TEdenovo) and annotation of TEs (TEannot). Using each reference genome, the TEdenovo pipeline identifies TE copies and provides a consensus sequence when at least five copies of the same TE family are retrieved within the genome. Then, TEannot pipeline uses the consensus sequences to annotate the provided reference genomes. Two files containing the structural annotation of 'Z1' and 'Chiifu' were output in GFF3 format (Generic feature Format version 3). The resulting GFF3 files were hand curated using custom python scripts (Python version 2.7.12).

Comparative structural analyses
To evaluate large-scale genome structural rearrangements, the assembled genomes of B. rapa 'Z1' and 'Chiifu', as well as the A subgenome of Brassica napus Darmor Bzh V4.1 were retrieved from the Brassica database (http://brassicadb.org/) and compared. The package MUMmer V3.23 (Kurtz et al., 2004) was used on each of the 10 chromosomes using the following settings: -id 95, -l 4000. Inversions or deletions observed in 'Z1' when compared to 'Chiifu' were further verified using BioNano raw data from Belser et al., (2018). To better characterize these genomic regions, gene annotations were retrieved from GFF files. GO enrichment was performed on GO terms of the identified genes present in specifically inserted regions of 'Z1' using Agrigo V2 (Tian et al., 2017). Finally, we assessed the duplication status of these genes using PCK (Murat et al., 2015) and custom made python scripts. In addition, among all inversions three showing defined breakpoints in both 'Z1' and 'Chiifu' were chosen and validated using custom-made PCR primers designed in flanking regions of inversions and specific to either 'Z1' or 'Chiifu' accessions (Supplementary Table 2).

Phylogenetic analyses
To study the evolutionary relationships of 45S rDNA and Gypsy sequences of Brassica rapa 'Chiifu' and 'Z1', phylogenetic analyses were performed. First, 45S rDNA and Gypsy sequences (firstly translated to protein sequences and merged to the 96 Gypsy database sequences) of 'Chiifu' and 'Z1' were aligned separately using Mafft alignment (v.7.245; --auto option, (Katoh and Standley, 2013). Second, beginning and end of the matrices with less than 25% of the sequences were deleted as well as indels present in more than 25% of the sequences. Phylogenetic analyses were performed using Geneious tree builder with the Jukes-Cantor model and the Neighbour-joining method (45S rDNA clade support was determined by 100 bootstrap replicates). To avoid Long Branch Attraction error, sequences presenting long branches were deleted and new phylogenetic analyses were performed using the parameters described above.

Genome size estimation by flow cytometry
In order to explore the intraspecific variability of genome size among the different

Structural analysis of two highly contiguous Brassica rapa genomes
Structural comparison of the B. rapa 'Z1' genome against both B. rapa 'Chiifu' genome and B. napus 'Darmor Bzh' A subgenome revealed six large sequences specific to B. rapa 'Z1' on A01, A05 (two regions), A06, A08 and A09 (Table 1, see also Figure 1). These regions were qualified as 'specific' because no orthologous match could be found in the assembled Brassica genomes tested in this study.  Figure   1). In B. rapa 'Z1', several chromosomes (especially the A05 and A06 chromosomes) had a higher proportion of TEs compared to 'Chiifu' assembled chromosomes. As  (1,347 copies).

Repeat element content in B. rapa 'Z1' and 'Chiifu'
As the large specific sequences of 'Z1' were made predominantly of rDNA and LTR Gypsy, phylogenies of both repeated elements were performed to explore their evolutionary dynamics. Ribosomal DNA phylogeny of the 'Chiifu' and 'Z1' genomes allowed us to identify three clades, one common to the two B. rapa samples (clade 3) and two containing only 'Z1' sequences (clades 1 and 2). These two clades contained 67 and 96 'Z1' rDNA sequences, respectively, coming from chromosomes A01, A05, A06 and A09 (Figure 4, A). LTR Gypsy phylogeny allowed us to identify six major clades (Figure 4, B). Proportions of sequences within each LTR Gypsy clade was not significantly different between 'Chiifu' and 'Z1' (χ 2 test; = 0.05).  Table 4).

Genome size variation and presence of inversions among B. rapa accessions
DNA content was obtained for 'Z1', 'Chiifu' and 10 accessions representing B. rapa core collection. After checking normality of data (Shapiro's test, P-value = 0.487), the significative impact of accessions was modelised with an ANOVA including one factor (df = 11, P-value = 4.19e-13), followed by Tukey's pairwise comparisons and group assignment based on P-value < 0.05. Of particular interest, the two focal species 'Chiifu' and 'Z1' have significantly different 1C DNA content with 0.557 (median = 0.565) and 0.590 (median = 0.588), respectively (Tukey HSD, P-value = 0.019). All accessions can be divided in seven subspecies and six groups structured within the core-collection ( Figure 6). The variation in genome size does not seem to be associated with any of those phylogenetic structures.
Finally, we explored the presence of these inversions in the same core-collection of B. rapa accessions. Although, it appeared that some nucleotide divergence might have prevented amplification, some accessions seem to share preponderantly either one or the other genetic structure (inversion on A05 zone II: populations 1, 2, 4, 5 and 10 might seem alike 'Z1' whereas populations 3, 7, 8 and 9 seem alike Chiifu; inversion on A10: populations 3, 4, 5, 6, 7, 9 and 10 seem more alike 'Z1' than ).

DISCUSSION
The large phenotypic diversity within B. rapa species has always triggered intensive investigations from the community. Ancient whole genome duplication followed by subgenome parallel selection has been shown to play a role in morphotype diversification in B. rapa (Cheng et al., 2016b). In addition, contrasted levels of cytosine methylations correlating with various TE densities have been proposed to shape subgenome dominance and fractionation levels within the genome of B. rapa (Cheng et al., 2016a). However, the extent of TE diversity within the species as well as the structural genomic variants that might additionally explain within species diversity have merely been touched upon.

Detection of structural variations between two accessions of B. rapa and impact on genome size
A recent study on whole-genome resequencing of eight B. napus accessions revealed millions of genomic variants resulting in 6.8% to 13.2% (for a total of 77.2 Mb to 149.6 Mb) of the genome presenting such variations (Song et al., 2020).
Similarly, chromosome-level assemblies of seven A. thaliana genomes unravel overall 13-17 Mb of rearranged genomic variants and up to 6.5 Mb (or 5.5% of the reference genome) of genome-specific sequence based on all possible pairwise comparisons (Jiao and Schneeberger, 2020). In the present study, we identified 23 Mb of inversions (representing 4.3% of the genome) and 73 Mb of specific insertions (representing 13.8% of the genome) in 'Z1' compared to 'Chiifu'. These results show the large unexplored genomic diversity present in only two accessions of B. rapa.
Although we cannot directly address their role in phenotypic diversification within the species, these insertions include 439 coding regions for which almost half of them cannot find an orthologous copy in the 'Chiifu' genome. These genes might be part of the dispensable gene pool as explained in a study gathering ten accessions of B.
Besides the functional consequences of such variants, the whole genome structure could be particularly impacted. The variation in genome sizes between our focal accessions has been previously addressed with estimations varying between 529 Mb and 443 Mb for 'Z1' and 'Chiifu' respectively (Belser et al., 2018;Zhang et al., 2018). However, the disparate methods used: either by flow cytometry or kmer analysis can reveal wide differences (see Zhang et al., 2018). Here, using internal control and repeated evaluations of DNA content by flow cytometry, we estimated an increase of ca. 6% of genome size between 'Z1' and 'Chiifu'. In addition, we were able to validate the difference of the A06 chromosome lengths between the focal accessions using a ribosomal DNA probe in a BAC-FISH experiment showing first hand the possible additional impact of these elements on chromosome and genome size.
Extending our analysis to the B. rapa core-collection, we explored the variation in genome sizes among different accessions of B. rapa representatives of the phenotypic diversity in the species. Interestingly, up to 16% variation in genome size was observed among the 12 accessions without visible structuration according to their phylogenetic relationships. These differences are probably due to intraspecific variation in repeat content and insertion/deletion of sequences as their expansion correlates with the diversification of morphotypes (ca. 1mya, Zhao et al., 2013).
Very few studies tried so far to identify intra-specific genome size variation in Brassicaceae (with same ploidy level and same number of chromosomes). In Johnston et al., (2005), B. rapa species (including unknown accessions) exhibited wider standard error than any other Brassicaceae species. By comparison in Arabidopsis thaliana, the analyses of DNA content in ten ecotypes showed little variation with less than 1% among the different accessions (Johnston et al., 2005) which seemed to be corroborated with recent assemblies of various A. thaliana accessions with low genome size variance (Jiao and Schneeberger, 2020) .

Contrasted repeat landscapes between the focal accessions
Usually, genome size variation is often associated with TE content and ribosomal repeats as these elements could be highly dynamic. The large specific insertions identified in 'Z1' allowed to improve our understanding of centromeric and pericentromeric regions that are usually difficult to sequence and assemble in complex genomes such as plant genomes, due to their high density in repeats.
Interestingly, apart from some dispensable genes, specific 'Z1' insertions exhibited mostly both rDNA copies and LTR elements.
In Brassica species, the number of 45S rDNA loci is variable. For example, B.
oleracea, B. nigra and B. rapa have four, six and ten rDNA loci, respectively (Hasterok et al., 2001). However, the number of 45S (and 5S) rDNA loci varies among varieties within each species. Hasterok et al., (2006)  varieties (Hasterok et al., 2006). However, based on the signal intensity, as shown by (Xiong and Pires, 2011) for the chromosome A05 of B. rapa 'Chiifu' vs. A05 of 'IMB218', the loci on the chromosome A06 in B. rapa 'Z1' had low signal intensity but showed a decondensation while the loci on the chromosome A06 in 'Chiifu' are fully condensed. As the A03 locus has been shown to be the only one carrying transcriptionally active ribosomal genes in the nucleolar organizer region (Hasterok and Maluszynska, 2000), it could be of interest to test if the 45S rDNA genes localized on the A06 in 'Z1' are transcribed and active.
Although the number of loci between 'Z1' and 'Chiifu' is identical, the number of 45S rDNA copies is different between these two varieties. The number of copies estimated in 'Z1' (1,750 copies) is highly similar to the number of rDNA copies previously estimated in this variety (1,709 copies, Sochorová et al., 2017) and is more abundant than the number of copies identified in 'Chiifu' (1,416 copies). This difference could be the result of rearrangements with loss of rDNA copies in B. rapa 'Chiifu', as demonstrated in the polyploid species such as B. napus or Nicotiana tabacum (Lim et al., 2000;Książczyk et al., 2011;Sochorová et al., 2017). However, rDNA phylogeny of 'Chiifu' and 'Z1' (Figure 4, A) has shown that among the three clades identified, two included only 'Z1' sequences (163 sequences). Thus, the most parsimonious scenario would indicate that these two clades most probably result from duplication events in 'Z1' after the divergence between the two focal accessions and not from loss of copies in 'Chiifu'.
Regarding TE content, assembly of the 'Z1' genome has allowed a first comparative exploration of the repetitive compartment between 'Z1' and 'Chiifu' (Belser et al., 2018), however differences were mainly attributed to methodological dissimilarities between genome assemblies. Here, using a consistent analysis pipeline for TE detection and annotation in these two accessions, we can reveal the impact of TEs on B. rapa intraspecific diversity taking into account the structural variations.
Interestingly, a tangible difference in TE content is consistently observed between these two accessions and we proposed that this pattern is not widely distributed along the genome but focused on specific chromosomes and identified structural variants.
The detailed analysis of the structural variants in 'Z1' showed that while inversions do not significantly sustain a higher proportion of TEs compared to the rest of the genome, specific insertions in 'Z1' carry a large proportion of LTR Gypsy and LINEs.
TE proportions could be underestimated in these regions as a large part of the inserted sequences were unsequenced data, we are nonetheless confident on the length of these sequences thanks to reliable optical mapping scaffolds.
TE intraspecific diversity has been previously shown to be driven by an expansion of recent LTR retrotransposons in B. rapa with rapid nucleotide substitution (Zhao et al., 2013). However, we do not observe such patterns using a phylogenetic approach. Instead, all LTR Gypsy clades seem to be represented in the inserted regions. This tendency seems to be particularly exacerbated in B. rapa and by extension in the A subgenome of B. napus compared to B. oleracea and the C subgenome of B. napus as demonstrated in Song et al., (2020). In addition, the inserted LTR retrotransposons do occur in clusters on chromosomes A05, A06 and A09 and their spreading predates the divergence between B. rapa and B. oleracea (Song et al., 2020).

CONCLUSION and PERSPECTIVES
Our results unveil the necessity for long-read sequencing strategies and developing various whole-genome references for agronomically important species (Danilevicz et al., 2020). This recent endeavour in constructing pangenomes is revealing the importance of structural variants and the diversity in repeat content within species.
This is especially relevant in species showing broad phenotypic diversity such as Only alignment blocks with greater than 95% of identity and a length greater than 4,000 bp were shown.   Gypsy sequences from the Gypsy database for identifying major clades of TEs.