Genome-wide analysis of long terminal repeat retrotransposons from the cranberry Vaccinium macrocarpon

BACKGROUND Long terminal repeat (LTR) retrotransposons are widespread in plant genomes and play a large role in the generation of genomic variation. Despite this, their identification and characterization remains challenging, especially for non-model genomes. Hence, LTR retrotransposons remain undercharacterized in Vaccinium genomes, although they may be beneficial for current berry breeding efforts. OBJECTIVE Exemplarily focusing on the genome of American cranberry (Vaccinium macrocarpon Aiton), we aim to generate an overview of the LTR retrotransposon landscape, highlighting the abundance, transcriptional activity, sequence, and structure of the major retrotransposon lineages. METHODS Graph-based clustering of whole genome shotgun Illumina reads was performed to identify the most abundant LTR retrotransposons and to reconstruct representative in silico full-length elements. To generate insights into the LTR retrotransposon diversity in V. macrocarpon, we also queried the genome assembly for presence of reverse transcriptases (RTs), the key domain of LTR retrotransposons. Using transcriptomic data, transcriptional activity of retrotransposons corresponding to the consensuses was analyzed. RESULTS We provide an in-depth characterization of the LTR retrotransposon landscape in the V. macrocarpon genome. Based on 475 RTs harvested from the genome assembly, we detect a high retrotransposon variety, with all major lineages present. To better understand their structural hallmarks, we reconstructed 26 Ty1-copia and 28 Ty3-gypsy in silico consensuses that capture the detected diversity. Accordingly, we frequently identify association with tandemly repeated motifs, extra open reading frames, and specialized, lineage-typical domains. Based on the overall high genomic abundance and transcriptional activity, we suggest that retrotransposons of the Ale and Athila lineages are most promising to monitor retrotransposon-derived polymorphisms across accessions. CONCLUSIONS We conclude that LTR retrotransposons are major components of the V. macrocarpon genome. The representative consensuses provide an entry point for further Vaccinium genome analyses and may be applied to derive molecular markers for enhancing cranberry selection and breeding.

activity, we suggest that retrotransposons of the Ale and Athila lineages are most promising 25 to monitor retrotransposon-derived polymorphisms across accessions. 26 CONCLUSIONS: We conclude that LTR retrotransposons are major components of the V. 27 macrocarpon genome. The representative consensuses provide an entry point for further 28 Vaccinium genome analyses and may be applied to derive molecular markers for enhancing 29 cranberry selection and breeding. 30 Keywords: cranberry, Vaccinium macrocarpon, repetitive DNA, LTR retrotransposon, Ty1-31 copia, Ty3-gypsy 32

INTRODUCTION 33
The genus Vaccinium L. belongs to the family Ericaceae Juss. comprising approx. 450 34 species distributed all over the world [1]. Available molecular phylogenies suggest that the 35 genus is monophyletic, whereas intrageneric relationships are more difficult to resolve [2,3]. 36 Dependent on different systematic treatments, the genus is divided into approximately 30 37 sections and subgenera [1][2][3][4]. Vaccinium macrocarpon Aiton, also known as the American 38 cranberry, is native to North America [5,6] and is one of the most commonly cultivated 39 Vaccinium species, along with the highbush blueberry (V. corymbosum L.) [5]. The small 40 berry fruits produced from wild and cultivated species of the genus are rich in nutritious 41 secondary plant metabolites, including some beneficial for human health with anticancer, 42 antioxidant, antidiabetic, and many other properties [7,8]. The American cranberry and the 43 European cranberry (a close relative, V. oxycoccus L.) belong to the same section or 44 subgenus, named Oxycoccus [1,4]. Although these two species differ in geographical 45 distribution, they can produce fertile offspring upon hybridization [5,6]. 46 Vaccinium species have ploidy levels ranging from diploid to hexaploid, with a base 47 chromosome number x = 12 [5]. As cranberry (V. macrocarpon) is a diploid and self-fertile 48 species with a relatively small genome size (approximately 470 Mb), it has become a 49 valuable reference to study the genetics and genomics of the genus [6,[9][10][11]. So far, whole 50 genome sequence data, a reference genome assembly, and transcriptome datasets have been 51 published for the diploid cranberry and for both the diploid and the tetraploid blueberries 52 [11][12][13]. In addition, several high-throughput genetics, genomics, and transcriptomics 53 datasets as well as other important breeding information of different species of the genus 54 Vaccinium are either published or underway, accessible through the web platform 55 was composed of TEs [10]. Very recently, a TE annotation was also made available for the 81 genome of tetraploid blueberry (V. corymbosum) [13]. However, the contribution of 82 individual TE families to these genomes is still not well-studied [10][11][12]. To representatively survey the cranberry LTR retrotransposon landscape, we used publicly 92 available whole genome paired-end Illumina reads of Vaccinium macrocarpon (NCBI 93 BioProject PRJNA245813) of cranberry cultivar 'Ben Lear' (accession number CNJ99-125-94 1) [10]. Before clustering, we pre-treated the reads by quality filtering, adapter trimming, 95 and processing of read length: Illumina Truseq adapters were removed using Trimmomatic 96 with the parameters ILLUMINACLIP:TruSeq3-PE-2.fa:2:30: 10 [37]. Fastx_trimmer and 97 FASTx quality filter from FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit) were 98 used to trim all reads to 150 bp length. Shorter sequences remaining after quality filtering 99 and trimming, were removed with seqtk (http://github.com/lh3/seqtk). This was followed by 100 interlacing of paired-ends with the FASTX-Toolkit. We Table S1). 239 For a comprehensive overview of the LTR retrotransposon population, we identified the 240 reverse transcriptase (RT) domains in the V. macrocarpon reference genome assembly. We 241 queried the V. macrocarpon assembly and identified 181 Ty1-copia and 294 Ty3-gypsy RT 242 instances. The highest number of RT sequences belonged to the Ale-(n = 110, Ty1-copia) 243 and Tat-(n = 184, Ty3-gypsy) types (Table S3). 244 Maximum-likelihood analyses allowed us to verify the classification of the reference full-245 length elements for both superfamilies. For the Ty1-copia retrotransposons, we noticed a 246 close relationship between five well-defined lineages: Ikeros, Bianca, Tork, TAR, and Page 13 Angela. A close relationship of Ivana and SIRE sequences was also observed. Although 248 Alesia only harbors two members, it forms a single, well-separated branch. Finally, the 110 249 Ale RT sequences are polyphyletic and strongly diversified (Fig. 1A). In the superfamily 250 Ty3-gypsy (Fig. 2), the non-chromoviruses and chromoviruses are similarly organized. The 251 non-chromovirus retrotransposons of the Tat type were the most abundant and the most 252 diversified. The chromovirus lineage produced four branches representing the clades CRM, 253 Reina, Galadriel, and Tekay ( Fig. 2A), with a clade of non-chromovirus Athila as sister. 254 The separation of the chromoviruses into clades could be verified by reconstructing a 255 dendrogram based on their signature chromodomain (CHD) sequences (Fig. S2). Although 256 full-length Reina elements were not found in the RepeatExplorer-based reconstructed 257 retrotransposon sequences (Table S4), the CHD search along the genome assembly of the 258 same V. macrocarpon accession, reveals that CHD protein sequences are indeed present 259 from all four chromovirus clades, i.e., CRM, Galadriel, Tekay, and Reina (Fig. S2). 260 Although branch lengths in the dendrogram provide information on the relative divergence 261 between retrotransposon groups, the boxplot visualizations allow a direct comparison of the 262 respective RT diversities (Figs. 1B, 2B). For this, pairwise identities from the RT amino acid 263 multiple sequence alignments were utilized. In Ty1-copia lineages, Ale RT sequences have 264 the highest diversity, as they exhibit the widest spread with a maximum of 90 %, a minimum 265 of 28 %, and a median pairwise identity of 59 %. The lowest median identity (57 %) is 266 observed for Ivana, whereas the highest conservation is observed for Bianca with a median 267 pairwise identity of 84 % (Fig. 1B). Generally, Ty1-copia RT sequences are more diverse 268 than those from Ty3-gypsy. Regarding the median RT identity, the most diverse clade in the 269 Ty3-gypsy superfamily is Tat (61%), whereas Reina is the least diversified (73 %). The 270 remaining clades of the Ty3-gypsy superfamily (Athila, CRM, Galadriel, and Tekay) have similarly diverse RT sequences, with median RT identities ranging between 65 % and 74 % 272 ( Fig. 2B). 273

Structural diversity of Ty1-copia and Ty3-gypsy LTR retrotransposons 275
Based on the clustered short reads, we reconstructed and characterized 26 Ty1-copia and 28 276 Ty3-gypsy full-length retrotransposons, representative for 54 repeat families in the genome 277 of V. macrocarpon ( Table 1). The majority of these full-length elements (40 out of 54) was 278 represented by a single, circular RepeatExplorer cluster in our analysis (Fig. S3). Reads from 279 the remaining Ty1-copia and Ty3-gypsy elements were split into smaller clusters. Due to 280 overlapping read pairs, we were able to connect these smaller clusters to superclusters 281 spanning retrotransposons of full length (Table S1) Table 1; Table S5). 287 To illustrate the structure of the individual Ty1-copia lineages, we selected eight 288 representative elements ( Fig. 3; Fig. S4). Among all the Ty1-copia lineages, we 289 reconstructed seven full-length Tork, five Ale, two Alesia, one Bianca, three Ivana/Oryco, 290 and one Sireviruses/Maximus (Table 1; Table S5): The longest consensus element belongs to 291 the latter group and is about 7 kb, it is partially incomplete, containing only the 5' LTR 292 (1545 bp). Other lineages have retrotransposon lengths ranging between 4.8-6.8 kb with 293 LTR lengths between 127-994 bp ( Table 1; Table S5). While the highest genome proportion 294 among the reconstructed Ty1-copia elements belongs to the elements SCL22_Ale (0.57 %), the lowest genome proportion is observed for SCL344_Alesia (0.01 %) (Table S5). Although 296 all full-length in silico Ty1-copia elements contain uninterrupted gag and pol protein 297 domains, some characteristic retrotransposon features were absent. For example, while most 298 of the in silico elements that represent Ale, Angela and Tork retrotransposons harbored the 299 PBS, PPT, and both 5' and 3' LTRs, one of these structural motifs was missing in the 300 representative in silico sequences of Bianca, Ivana/Oryco, TAR, and Sireviruses/Maximus 301 (Table 1; Table S5). The most common PBS-type detected in Ty1-copia lineages 302 corresponds to the Met-type. For Ale and TAR retrotransposons, however, a PBS of the 303 Asn/Met-type (Ale) or Glu/Ser/Met/Tyr/Trp-type (TAR) was more common (Table 1; Table  304 S5). 305 Six reference elements were selected to represent the structural variability of each Ty3-gypsy 306 lineage (Fig. 4). The most diversified and abundant Ty3-gypsy elements in the V. 307 macrocarpon genome belong to the Tat group (Table 1). A total of 13 full-length elements 308 have been characterized, with genome proportions ranging between 0.12 and 1.34 % (Table  309 1). In contrast, the other two groups only contribute 0.01-1.92 % (Athila) and 0.02-0.23 % 310 (chromoviruses) to the genome (Table 1; Table S5). Although gag and pol genes were 311 detected for all full-length elements, structurally complete elements with all motifs typical 312 for LTR retrotransposon (PBS, PPT and both 3' and 5' LTR regions) were mostly present in 313 the Galadriel clades (chromoviruses). Most elements of the other groups (chromovirus 314 Tekay, CRM and non-chromoviruses Tat and Athila) lack at least one of these structural 315 motifs. Moreover, some Ty3-gypsy elements have additional protein domains as compared to 316 Ty1-copia elements. For instance, the pol genes of the non-chromovirus Tat clade have dual 317 ribonuclease H domains (gRNH and aRNH), whereas the chromoviruses harbor an 318 additional chromodomain (CHD) region ( Fig. 4; Table S5). A canonical CHD was only 319 found for the elements of the Tekay and the Galadriel clades, but surprisingly not within the Page 16 elements of the chromovirus CRM clade ( Fig. 4; Fig. S4; Table S5). SCL16_Ogre is the 321 longest Ty3-gypsy Tat element, with a total length of 19 kb and 1251-1326 bp long LTRs. In 322 contrast, SCL80_CRM is the shortest element (5.2 kb) with LTRs ranging between 361-556 323 bp (Table S5). A Met-type PBS was only found in the CRM and Galadriel chromoviruses, 324 whereas other Ty3-gypsy elements were more variable regarding their PBS (Table 1). 325 326 Cranberry retrotransposons are often associated with short tandem repeats 327 Although tandem repeats, i. e. satellite or ribosomal DNAs, represent major repetitive 328 genome fractions that are distinct from the more dispersed transposable elements, tandemly 329 repeated motifs can also be associated with retrotransposons. In V. macrocarpon, we 330 detected tandem repeats in the reconstructed full-length Ty1-copia and Ty3-gypsy 331 retrotransposon sequences (Table 2; Table S6). These tandem repeats were either localized 332 within the 3' or the 5' LTRs, within the untranslated regions (UTRs) adjacent to either LTR, 333 or in some cases within the UTRs adjacent to the gag gene. Tandem repeats found in 334 different retrotransposons vary in sequence composition, period size, and number of 335 repetitions (Table 2; Table S6). For the Ty1-copia superfamily, we identified tandem repeats 336 in all lineages, embedded in either the LTRs or UTRs, with consensus period lengths ranging 337 from 2 to 49 bp with about 2 to 8 repetitions (Table 2; Table S6). 338 In Ty3-gypsy LTR retrotransposons, tandem repeats were detected both in non-339 chromoviruses and chromoviruses, yet to a different extent. Non-chromovirus Tat and 340 chromoviruses harbored tandem repeats both within the LTRs and UTRs, whereas non-341 chromovirus Athila had tandem repeats only embedded within the UTR adjacent to the gag 342 gene (Table 2). Among all Ty3-gypsy LTR retrotransposons, Tat (including Ogre) harbor 343 tandem repeats with the largest monomer size (up to 93 bp) and the highest number of repetitions (up to 15 times) ( Table 2; Table S6). 345 Cis-regulatory elements are often associated with LTR sequences 347 As LTRs frequently harbor cis-regulatory elements, our targeted search revealed that almost 348 all reconstructed retrotransposons include typical promoter motifs, including TATA and 349 CAAT boxes within their LTRs (Fig. S5A-B; Table 2; Table S7) Table S7). 356 In addition to the promoter motifs, we detected other cis-regulatory elements for different 357 LTR retrotransposons (Table S7). These regions are mostly related to light, hormone, and 358 cell cycle responsiveness, as well as abiotic stress tolerance (Fig. S5A In an attempt to fine-scale the classification of the full-length retrotransposon elements of V. 366 macrocarpon, closely related elements were subjected to an all-against-all sequence comparison with publicly available references to reveal sequence similarities among them 368 (Table S2). 369 Overall, the V. macrocarpon full-length elements have higher similarity to other elements 370 within their protein coding regions, contrasting the variability within their UTRs and LTRs 371 ( Fig. S6A-E). However, the levels of similarity for coding and non-coding regions vary 372 depending on the lineage (Fig. S4). For instance, the RT similarities within the Ty1-copia 373 and within the Ty3-gypsy groups ranged between 57-74 % and 55-88 %, respectively. In 374 contrast, LTR similarities ranged between 17-47 % for Ty1-copia groups and between 11-375 55 % for Ty3-gypsy groups, respectively ( Fig. 6A-E; Fig. S4; Table 1). Ale and Tat 376 sequences are the most diversified, both within their RTs and their LTRs (Table 1). 377 378

Transcriptome analysis reveals transcriptional activity of LTR retrotransposons 379
We investigated whether the identified full-length LTR retrotransposons are transcribed into 380 RNA by analyzing the reference transcriptome database of V. macrocarpon from leaves and 381 shoot tips ( Fig. 5; Fig. S7; Table S8). We have mapped a total of 36.1 million transcript 382 sequences against the 54 reconstructed Ty3-gypsy and Ty1-copia sequences of V. 383 macrocarpon. Subsequently, we compared the genome proportion derived from the 384 RepeatExplorer analysis against the transcriptome proportion for each individual element 385 (Fig. 5). 386 All the identified full-length elements are transcribed to a different extent, thereby 387 contributing a total of 0.1 % by Ty1-copia retrotransposons and 0.03 % by Ty3-gypsy 388 retrotransposons to the used V. macrocarpon transcriptome reads ( Fig. 5; Fig. S7; Table S8). 389 Among all LTR retrotransposon lineages, Ale has the highest proportion of transcript 390 sequences ( Fig. 5; Table S8). In fact, the two reference elements SCL5_Ale and SCL22_Ale [57] maize [58], and oats [59]. 438

Cranberry LTR retrotransposons are diverse in sequence and structure 440
Overall, in the V. macrocarpon genome, Ty3-gypsy elements are on average longer and 441 contribute a higher genome proportion than Ty1-copia elements. This was also reported in 442 other plants, such as Avena sativa [59], Triticum sp. rather strictly regulated in cranberry [10]. 555  Lineages are color-coded (see internal legend). The stars indicate the position of the representative reconstructed full-length elements from V. macrocarpon genome shown in Fig. 3A. B) The boxplots illustrate the pairwise sequence identities of amino acid sequences of all Ty1-copia RTs. The lineage Alesia was excluded from the boxplots as only a single copy was detected in the V. macrocarpon genome. Colors in the boxplots correspond to panel (A). For each lineage, the total number of RT sequences (n) is provided, which includes the reference sequence.  Fig. 4A. B) The boxplots illustrate the pairwise sequence identities of amino acid sequences of all Ty3-gypsy RTs. Colors in the boxplots correspond to panel (A). For each clade, the total number of RT sequences (n) is provided, which includes the reference sequence    (Table S2). Transcriptome proportions were calculated from read mapping of the publicly available Illumina cDNA sequence reads of V. macrocarpon (accession number PRJNA246586). Major groups of the 54 elements are color-coded (see internal legend). Only the names of elements of the highest genome and/or transcriptome proportions are annotated. For details, see Table S2.