Centromere-associated retroelement evolution in Drosophila melanogaster reveals an underlying conflict

Centromeres are chromosomal regions essential for coordinating chromosome segregation during cell division. While centromeres are defined by the presence of a centromere-specific histone H3 variant rather than a particular DNA sequence, they are typically embedded in repeat-dense chromosomal genome regions. In many species, centromeres are associated with transposable elements, but it is unclear if these elements are selfish or if they play a role in centromere specification or function. Here we use Drosophila melanogaster as a model to understand the evolution of centromere-associated transposable elements. G2/Jockey-3 is a non-LTR retroelement in the Jockey clade and the only sequence shared by all centromeres. We study the evolution of G2/Jockey-3 using short and long read population genomic data to infer insertion polymorphisms across the genome. We combine estimates of the age, frequency, and location of insertions to infer the evolutionary processes shaping G2/Jockey-3 and its association with the centromeres. We find that G2/Jockey-3 is an active retroelement targeted by the piRNA pathway that is enriched in centromeres at least in part due to an insertion bias. We do not detect signatures of positive selection on any G2/Jockey-3 insertions that would suggest than individual copies are favored by natural selection. Instead, we infer that most insertions are neutral or weakly deleterious both inside and outside of the centromeres. Therefore, G2/Jockey-3 evolution is consistent with it being a selfish genetic element that targets centromeres. We propose that targeting centromeres helps active retroelements escape host defenses, as the unique centromeric chromatin may prevent targeting by the host silencing machinery. At the same time, centromeric TEs insertions may be tolerated or even beneficial if they also contribute to the transcriptional and chromatin environment. Thus, we suspect centromere-associated retroelements like G2/Jockey-3 reflect a balance between conflict and cooperation at the centromeres.


INTRODUCTION
located in centromeres; about 40% of all insertions are found outside of the centromere islands 93 [53]. Polymorphic G2/Jockey-3 insertions [58] and transcripts from published RNAseq datasets 94 [59][60][61] suggest this element may still be active in D. melanogaster [53]. G2/Jockey-3 is also 95 found at the centromeres in D. simulans, a close relative of D. melanogaster, suggesting that its 96 centromere association may be broadly conserved [53]. 97 98 It is unclear whether G2/Jockey-3 is an important centromere component in D. melanogaster or 99 if they are instead purely selfish TEs. If G2/Jockey-3 are important for centromere specification 100 or function, we may detect evidence for positive natural selection on centromeric copies. 101 short read datasets with >20X centromere coverage (Table S3)  in this paper. A heterochromatin-enriched assembly was originally used to discover centromere 166 islands in Chang et al. [53] and identify reference TE insertions. We used short reads from the 167 DGRP to identify reference and de novo insertions to study TE dynamics within a single 168 population from North Carolina (map source: R package usmaps https://cran.r-169 project.org/web/packages/usmap/). We generated genome assemblies for 18 samples from global 170 populations to uncover the centromere island sequences and de novo elements (map source: 171 www.outline-world-map.com). We combined the reference sequences with population 172 frequencies to infer selection with an age-of-allele test [70]. We also combined reference TE 173 sequences with de novo elements uncovered in our assemblies for phylogenetic analysis. and classified each insertion based on their presence (a "reference" insertion) or absence (a non-183 reference or "de novo" insertion, Figure 2) in our reference genome [53,73]. active families such as the F element, R1 and Roo. 197 Many active TEs are targeted by small, 24-30-nt long Piwi-interacting RNAs (piRNA) in the 198 germline [74,75]. These piRNAs target TEs for both transcriptional and post-transcriptional 199 silencing. In the D. melanogaster germline, piRNAs are produced from precursor transcripts that 200 originate from discrete loci in the genome that contain clusters of nested TEs [76] called piRNA 201 clusters [77]. Subsequent amplification of the piRNA signal comes from the piRNA-guided 202 cleavage of active TEs through the 'ping-pong' cycle [77,78] generating a signature of 10-nt 203 overlap between sense and antisense piRNAs. Consistent with G2/Jockey-3 being a recently 204 active element, we detect insertions of G2/Jockey-3 in the 38C1 (2 copies) and 38C2 (9 copies) 205 piRNA clusters, and G2/Jockey-3-derived piRNAs in the germline. G2/Jockey-3 is ranked in the 206 top 4% and top 18% of TE-derived piRNAs in testes and ovaries, respectively (Table S4) To understand if the observed enrichment of G2/Jockey3 at centromeres is due to insertion bias, 232 we studied insertion frequencies across the genome. We reasoned that an enrichment of 233 putatively young G2/Jockey-3 at centromeres may reflect an insertion preference. Young, 234 recently inserted TEs are expected to occur at lower frequencies than older TEs. We thus 235 examined the distribution of rare insertions (<10% population frequency) of G2/Jockey-3 and 236 other TE families in the DGRP. We found that rare G2/Jockey-3 insertions are significantly 237 enriched in centromeres compared to other regions relative to their size (19 centromere 238 insertions; Permutation tests: nperm = 10,000, P < 0.0001; Z-score = 26.736 for whole genome; P 239 < 0.0001; Z-score = 23.895, excluding heterochromatin windows; P < 0.0001; Z-score = 17.354, 240 excluding euchromatin windows; Figure 3). R1 elements are the only other TE enriched in 241 centromeres (4 centromere insertions; Permutation tests: nperm = 10,000, P < 0.0001; Z-score = 242 6.892 for whole genome; P < 0.0001; Z-score = 5.92, excluding heterochromatin windows; P = 243 0.0187; Z-score = 2.605, excluding euchromatin windows; Figure 3), but to a lesser degree than 244 G2/Jockey-3. Among rare insertions, G2/Jockey-3 is by far the most enriched TE in centromeres 245 (39% of all rare insertions). for several TE families in DGRP lines. Significant differences are noted as: * P < 0.05, *** P < 259 0.0001 in the permutation test. 260

261
We confirmed these patterns using an orthogonal approach with long-read genome assemblies of 262 flies from a wide geographic distribution (i.e., founder DSPR lines, GDL lines, and our reference 263 genome; Table S1). Here we identified putatively young G2/Jockey-3 insertions based on 264 sequence divergence instead of frequency. We classified young insertions as those with < 1% 265 divergence from the consensus sequence, or with < 2% divergence from the consensus and found 266 within a single genome assembly. Insertions classified as young were found in less than 50% of 267 the assemblies, had fewer unique substitutions, and were more likely to have a complete ORF 268 than older fragments (Figure 4, FET P = 3.93E-12, File S6, File S7). Consistent with our short-269 read DGRP findings (Figure 3), we found significant relative enrichment in centromeres 270 (Permutation tests nperm = 10,000: P < 0.0001; Z-score = 15.633 for whole genome; P < 0.0001; 271 Z-score = 7.505 excluding euchromatin windows), even with some of the centromere island 272 sequences missing in the long-read assemblies (Table S2). 273

274
These results suggest that G2/Jockey-3 has had at least some insertion preference for 275 centromeres. As insertion bias can be associated with a target site sequence, we searched for 276 motif enrichment at insertions sites. Although we found an enrichment of AT-rich simple motifs 277 like "TAGTTTT" and rDNA-associated sequences (e.g. 18S and non-transcribed spacers; Figure Figure S3). However, we did not observe differences 323 between the frequency spectra of the euchromatin and centromeres, suggesting most TEs may be 324 deleterious in these regions ( Figure S3). 325

326
The frequency spectra of G2/Jockey3 does not appear significantly different between 327 centromeres and heterochromatin. However, G2/Jockey-3 has more "common" insertions in 328 centromeres relative to other TEs, including Doc ( Figure 5; Figure S3). This could be due to 329 relaxed constraints allowing G2/Jockey-3 insertions to rise to higher frequencies over time or and euchromatin. Fisher's Exact Tests between regions, Bonferroni adjusted two-tailed P-value * 338 P < 0.05; *** P < 0.001; ns is not significant. 339

G2/Jockey-3 ages suggest a historical burst of activity 340 341
Centromeric G2/Jockey-3 copies are generally older-within a range of 2-7% divergence from 342 the consensus-than those in heterochromatin (MWU, P < 0.001, Figure 6A), but not 343 significantly different than euchromatic copies (MWU, P > 0.5; Figure 6A). This pattern 344 suggests that most centromeric copies of G2/Jockey-3 resulted from a historical burst of activity. 345 This burst of activity is consistent with general patterns observed at other non-LTR retroelements 346 [72,88]. Other TE families show a similar age distribution between heterochromatin and the 347 centromeres ( Figure S4). These copies are also older and highly fragmented. In general, both the 348 lack of recombination and lack of strong purifying selection against TE insertions in centromeres 349 and heterochromatin may allow these insertions to persist over longer evolutionary time periods 350 [89]. 351

352
We can combine information about the TE's age and frequency to infer if, and how, its history is 353 consistent with natural selection. In the absence of selection, the frequency of a TE insertion 354 should correlate with its age: e.g., older TEs have more time to drift to higher population 355 frequencies than young TEs and thus accumulate more unique mutations. Natural selection can 356 cause deviations from expected frequencies based on the age. While negative selection might 357 cause TEs to have lower population frequency than expected given their age, positive selection 358 might increase their frequencies. We therefore used an age-of-allele test [70] to detect whether 359 G2/Jockey-3 frequencies deviate from our expectations. We estimated age as the number of 360 unique substitutions for each TE copy based on the reference genome and we estimated 361 frequency from the 28 high-coverage DGRP samples, controlling for demographic history [70]. 362 For this test, we need to compare insertions spanning a range of ages to identify individual 363 insertion frequencies that deviate from expectations [70]. We thus examine G2/Jockey-3 and 364 other TE families that have centromere insertions. We find that G2/Jockey-3 ( Figure 6B, File 365 S9), like the majority of TEs, segregates either at the expected frequencies or lower than 366 expected, given their age ( Figure S6, File S9). This suggests that many G2/Jockey-3 copies, both 367 within and outside the centromeres, are either neutral or weakly deleterious. Therefore 368 Whitney U tests *** P < 0.001, ns is not significant. B) The difference between observed and 391 expected frequency of G2/Jockey-3 copies in the DGRP. The P-values are calculated as the sum 392 of the probabilities that an insertion is less frequent than expected (if observed -expected < 0) or 393 more frequent than expected (observed -expected > 0). * P < 0.1, a stringent threshold given the 394 reduced power to detect deviations from neutrality as stated in Blumenstiel et al. [70]. 395

Centromere island haplotype structure not likely driven by G2/Jockey-3 insertions 396 397
To detect signatures of natural selection on centromere haplotypes and any associations with 398 G2/Jockey-3 copies, we examined patterns of nucleotide variation. Centromeres and 399 pericentromeric regions in D. melanogaster generally do not recombine by crossing over [91,92] 400 therefore the effects of linked selection in non-recombining regions should lead to low levels of 401 nucleotide diversity. We thus expect to see reduced genetic variation in the centromeres and high 402 levels of linkage disequilibrium. We quantified genetic variation in the centromere islands for 403 DGRP samples with high centromere coverage (> 20X, see Methods). As expected, nucleotide 404 diversity is greatly reduced in every centromere compared to the major chromosome arms 405 (MWU P < 2.0E-07; Figure S7, Table S5) except for chromosome 4, which does not recombine, 406 and its centromere and chromosome arms are similar (MWU p = 0.56; Figure S8, Table S5). We 407 find that Tajima's D, a summary of the site frequency spectrum, is also typically lower in 408 centromere islands ( Figure  The Y chromosome centromere is unique in that it has the largest centromere island containing 421 nearly half of all G2/Jockey-3 fragments, but no identifiable satellite sequences [53]. We cannot 422 use the genomic re-sequencing data to study polymorphic insertions on the Y chromosome, as 423 the data are either female-only or mixed sex and thus lower coverage. However, our high-quality 424 reference assembly provides an opportunity to infer the organization and evolutionary history of 425 Y-linked G2/Jockey-3. We inferred a phylogenetic tree of all Y-linked copies of G2/Jockey-3. 426 Based on this phylogeny, we classified centromeric Y-linked G2/Jockey-3 copies into four broad 427 classes ( Figure S12). Class 1 G2/Jockey-3 correspond to young and mostly long insertions, class 428 2 to short (< 160 bp) 3' end fragments, class 3 to old internal fragments missing the 5' and 3' 429 terminal ends, and class 4 to generally long and old 3' end fragments divided into two subclasses 430 (4 and 4A) based on position (Figure 9, Figure S12). We mapped these classifications onto our Y 431 chromosome assembly and estimated pairwise divergence between all G2/Jockey-3 copies in the 432 Y centromere (Figure 7) and elsewhere on the Y chromosome ( Figure S13). Each class seems to 433 undergo independent duplication events, however at a larger scale we do not identify higher 434 order repeat domains (Figure 7). This structure differs from other known centromeres (e.g. 435 primates), which are organized into higher order repeats, the youngest repeats are found in the 436 center and the older, more divergent repeats, are at the outer edges of the array [51,52]. By 437 contrast, in the D. melanogaster Y centromere we observe that regions enriched with young 438 G2/Jockey-3 copies punctuate regions rich with older copies (Figure 9), suggesting that recent, 439 local rearrangements affect centromere organization. We estimated gene conversion rates of G2/Jockey-3 on the Y centromere with respect to their 455 classes (Supplemental Text). Our estimates for gene conversion are on the order of 1E-6 per site 456 per generation (Table S6) (Table S4). 502

503
The G2/Jockey-3 insertion polymorphism patterns and piRNA observations are most consistent 504 with some underlying selfish behavior typical among TE families, except that this element can 505 target centromeres. Although we are unsure of the precise targeting mechanism, we suspect the 506 target is the centromere chromatin rather than any particular DNA sequence.  Figure 8A). First, insertions are unlikely to be strongly disfavored, 518 as the function of centromeres is unlikely to be disrupted by a single insertion event. Second, 519 centromeric insertions may escape host silencing mechanisms. Small RNAs target TEs to limit 520 potentially harmful TE activity by inducing H3K9 methylation and thus transcriptional silencing 521 (reviewed in [110][111][112]). However, CENP-A structure and posttranslational modifications differ 522 from H3 [109], and the host silencing machinery is not known to target CENP-A. Therefore, 523 centromeres may offer a 'safe haven' for some TEs not just because they can avoid removal by 524 purifying selection, but also because TEs may temporarily escape targeting by small RNAs (e.g. 525 piRNAs, siRNAs) and alternative host defenses (e.g. DNA methylation in pericentromeres [113], 526 hypomethylated DNA in humans [52,114] limited amount of H3K9me2/3 (e.g., [13]). We propose that insertion of active TEs like 550 G2/Jockey-3 may contribute to this unique CENP-A/H3 chromatin at the centromeres ( Figure  551 8A). While landing in CENP-A nucleosomes may allow the TE to temporarily escape silencing, 552 if an active TE lands in an H3 nucleosome region of the centromere, it may be targeted for 553 silencing through H3K9 methylation (e.g. piRNA-or siRNA-mediated silencing) and thus 554 contribute to H3K9me2/3 domains within the centromere. Third, TEs may help mitigate other 555 conflicts that happen at centromeres. Some centromeric sequences have the capacity to be selfish 556 and bias their transmission through female meiosis in a process known as meiotic drive [131]. 557 Driving centromeres generally have expansions of satellite DNA sequences that can recruit more 558 kinetochore proteins and cause biased segregation toward the egg pole [132, 133] ( Figure 8B). 559 Meiotic drive at the centromeres can lead to rapid evolution of both centromere DNA and 560 proteins that work to restore fair segregation [134,135]. There are parallel pathways to suppress 561 meiotic drive involving rebalancing kinetochore proteins and spreading heterochromatin to limit 562 the CENP-A domain [136]. While centromere drive and suppression models predict that the 563 flanking heterochromatin domains are the main participants in the conflict [136], the 564 H3K9me2/3 domains within the centromeres might also contribute. Active centromeric TEs that 565 are targeted by small RNA pathways, like G2/Jockey-3, could lead to spreading of H3K9me2/3 566 heterochromatin from within the centromere ( Figure 8C). In this way, the centromeric TEs and 567 'internal' centromeric H3K9 domains may also be a substrate for suppressors of centromere 568 drive to stop the unchecked spread of CENP-A caused by meiotic drivers. 569 Even if TEs cooperate with centromeres, this may be an uneasy alliance, as there could still be an 570 underlying conflict. For example, the invasion of H3K9me2/3 heterochromatin in centromeres is 571 potentially dangerous, as it may evict too much CENP-A and silence a centromere [5]. 572 Therefore, the centromeric chromatin, and the relative abundance of CENP-A and H3 573 nucleosomes, may mediate the balance between conflict and cooperation between active TEs and 574 centromeres. 575 576 In many species across the tree of life, centromeres are associated with TEs [18,[23][24][25][26][27][28][29][30][31][32][33][34]53]. The 577 connection between TEs and centromeres is supported not just by TE sequences embedded in 578 and around present-day centromeres, but also in the origins of some proteins that function in 579 centromere biology (e.g. similarity between centromere protein CENP-B and Pogo family 580 transposons [137]). This relationship may represent a delicate balance between conflict and 581 cooperation between the centromeres and the repetitive selfish genetic elements that occupy 582 these genome regions. We expect this balance to vary depending on genomic context, level of 583 TE activity, and presence and frequency of centromere meiotic drive. If TEs contribute 584 positively to centromere function, many types of elements may be sufficient to provide the right 585 'environment'. In this case, we would not necessarily expect to see a signature of selection on 586 any one insertion or type of insertion. While G2/Jockey-3 is currently the principal element 587 occupying the centromere niche in D. melanogaster, it may be replaced over time by another 588 type of repeat. In our model, any repeat that can be transcribed and can recruit H3K9me2/3 (e.g. 589 is targeted by small RNAs) can contribute to centromere function. Detecting the mode, tempo, 590 and target of selection in non-recombining and dynamic environments like the centromeres is 591 difficult. Exploring the evolution of centromere composition in closely related species will 592 provide additional insights necessary to test these hypotheses. 593 To detect piRNAs corresponding to G2/Jockey-3, we trimmed the small RNA-seq reads for 645 adaptors and excluded the reads that align to rRNA or tRNA. We aligned the remaining reads to 646 the heterochromatin-enriched genome assembly [53,73] using Bowtie [144], and then used a 647 customized python script as described in Wei et al. [83] (Table S3). We retained samples 671 which had a centromere coverage greater than 20X for our analyses (see Supplemental Text for 672 more information). 673

674
We used the same short-read FASTQ files from the DGRP to quantify genetic diversity between 675 the centromere islands and the rest of the genome using a best practices GATK pipeline v4. Drosophila TE library with the updated G2/Jockey-3 consensus sequence [138] (File S10). We 710 identified centromere island sequences in assemblies using a combination of methods to find 711 sequences corresponding to the centromere islands identified in our reference genome [53]. 712 These include aligning assemblies to the centromere islands with MUMMER v.3.23 nucmer 713 [157], identification of similar sequences with BLAST 2.10.0+ [158], and identifying contigs 714 enriched in G2/Jockey-3 or satellite DNA repeats enriched around the centromere islands [53]. 715 We also used MEME [159] to detect motifs enriched in the flanking regions of young insertions 716 in our assemblies. Additional details of our analysis with long-read data, the results, and the 717 MEME motif analysis are in the Supplemental Text. 718 719 Insertion polymorphism validation 720 721 We validated TE insertion predictions in three ways: 1) we used short reads to confirm TE 722 predictions in long read assemblies (see Supplemental Text); 2) we used long read assemblies to 723 confirm TE predictions in short read data (see Supplemental Text); and 3) we selected a subset to 724 validate with PCR (File S3). We used the results from these analyses to establish criteria for 725 including a predicted insertion in our analyses. For PCR, we selected a subset of 59 G2/Jockey-3 726 copies to confirm in four GDL lines; B59, I23, T29A, and ZH26. We included copies from each 727 chromosome (X, 2, 3, 4), each chromatin region (euchromatin, heterochromatin, and 728 centromere), and both reference and non-reference copies. We were able to detect 33 copies 729 including 13 copies in the centromere and seven de novo insertions (File S3). 730 We used permutation tests to test for significant differences in density of recent G2/Jockey-3 731 insertions between genome regions. For these tests, the genome was divided into 90-kb windows 732 (average size of a centromere, excluding the Y chromosome and 2 nd chromosome) and 733 categorized as centromere, heterochromatin, or euchromatin. We estimated G2/Jockey-3 counts 734 in each window based on final insertion polymorphisms detected by short-read and long-read 735 datasets separately. We excluded the Y chromosome from these analyses. To test for enrichment 736 of G2/Jockey-3 counts within categories, we permuted the datasets 10,000 times using regioneR 737 We extracted all G2/Jockey-3 sequences from the annotated assemblies using BEDTools v2.29.2 744 [139] and classified the insertions as reference or non-reference (de novo). We classified 745 insertions as homologous to those in our reference if they were both fragments of similar size 746 (+/-10 bp) and within +100 bp of the same location in the reference genome. We identified 747 corresponding locations by aligning the assemblies to the reference genome with nucmer within 748 MUMMER v.3.23 [157]. Insertions were classified as de novo if they differed in fragment size 749 length and position of reference insertions, found outside the reference fragment position range, 750 and located within breaks in the nucmer alignment to the reference genome. 751 752 We performed phylogenetic analyses for candidate G2/Jockey-3 sequences identified in our 753 genome assemblies. We used G2/Jockey-3 sequences from the reference and the non-reference 754 insertions discovered in the DSPR and GDL assemblies. If a non-reference insertion was found 755 in multiple assemblies, we randomly selected an assembly from which to extract that insertion to 756 include in our analysis. Prior to phylogenetic analysis, we filtered sequences to exclude 757 fragments shorter than 400 bp or ~10% of the length of a full G2/Jockey-3 element. We aligned 758 the sequences with MAFFT [161] with the D. simulans G2/Jockey-3 consensus as the outgroup 759 and inferred a maximum likelihood phylogenetic tree with RAxML v8.2.11 [162]. To study the 760 evolution of G2/Jockey-3 on the Y chromosome, we inferred a maximum likelihood phylogeny 761 of all fragments exclusive to the Y chromosome regardless of size. We determined the relative 762 age of individual insertions using divergence from the consensus from the Repeatmasker output 763 and the number of substitutions unique to a fragment divided by fragment length. "Young" 764 insertions were either < 1% diverged from the consensus sequence or de novo insertions 765 exclusive to a single genome assembly and < 2% diverged from the consensus sequence. We The probability distribution is also conditioned on population demography of North American D. 798 melanogaster populations outlined in the original paper [70]. North American D. melanogaster 799 populations experienced two bottlenecks in their migration from Africa [167,168] that led to 800 non-equilibrium TE evolution [70]. We compared the probability distribution to the frequency of 801 insertions detected in the 28 DGRP samples with >25X average coverage of the centromere 802 islands. We excluded insertions in the Y chromosome given the high rate of duplication. The 803 expected frequency of each insertion in the population is the probability of observing an insertion 804 as frequent or more frequent within the population multiplied by the observed number of 805 insertions in our population sample. We calculated significant deviations from neutrality as 806 probability of observing as many or fewer, or as many or greater, copies < 0.1.