A burst of transposon expression accompanies the activation of Y chromosome fertility genes during Drosophila spermatogenesis

Transposable elements (TEs) must replicate in germline cells to pass novel insertions to offspring. In Drosophila melanogaster ovaries, TEs can exploit specific developmental windows of opportunity to evade host silencing and increase their copy numbers. However, TE activity and host silencing in the distinct cell types of the Drosophila melanogaster testis are not well understood. We reanalyzed publicly available single-cell RNA-seq datasets to quantify TE expression in the distinct cell types of the Drosophila testis. We developed a novel method for identification of TE and host gene expression programs and find that a distinct population of early spermatocytes expresses a large number of TEs at much higher levels than other germline and somatic components of the testes. This burst of TE expression coincides with the activation of Y chromosome fertility factors and spermatocyte-specific transcriptional regulators, as well as downregulation of many components of the piRNA pathway. The TEs expressed by this cell population are enriched on the Y chromosome and depleted on the X chromosome relative to other active TEs. These data suggest that some TEs may achieve high insertional activity in males by exploiting a window of opportunity for mobilization created by the activation of spermatocyte-specific and Y-chromosome-specific transcriptional programs.

have been identified (Cosby, Chang, and Feschotte 2019). In the Drosophila ovary, there is evidence that 36 some TEs propagate in permissive nurse cells and hijack the host's mRNA transport pathway to move to the 37 developing oocyte, which is more recalcitrant to TE expression (Wang et al. 2018). In another study, Dufourt et 38 al. identified a small region of mitotically dividing germline cysts where the piRNA pathway gene Piwi is 39 depleted and TE silencing is much weaker than in the surrounding cells. They termed this region the "piwiless 40 pocket" and proposed that TEs may take advantage of this niche to replicate in the Drosophila germline 41 (Dufourt et al. 2014). 42 TE replication and host silencing have been extensively studied in the Drosophila ovary, however surprisingly 43 little is known about these same phenomena in the testes. Several previous observations suggest that there 44 may be substantial differences between ovaries and testes with respect to both TE activity levels and host 45 silencing pathways. For example, multiple TE families are known to exhibit strong sex biases: The I-element, 46 P-element, and gypsy TE families are all expressed at higher levels in the female germline (Busseau et al. 47 1994; Pélisson et al. 1994; Roche, Schiff, and Rio 1995) whereas the opposite is true for the copia, micropia, 48 1731, and 412 TE families (Lankenau, Corces, and Lankenau 1994;Haoudi et al. 1997; Pasyukova et al. 1997; 49 Borie et al. 2002). The piRNA pathway is active in both somatic and germline cells in the ovary and piRNAs 50 bound by Aub and Ago3 undergo robust ping-pong amplification in the ovarian germline. In the testes, TE-51 derived piRNAs are produced in germ cells, however the vast majority (~75%) arise from the suppressor of 52 stellate [Su(Ste)] and AT-chX satellite repeats, rather than the canonical piRNA clusters that have been 53 identified in ovaries (Quénerch'du, Anand, and Toshie 2016; P. Chen et al. 2020). Furthermore, many TE 54 families show large differences in piRNA abundance between ovaries and testes (P. Chen et al. 2020) and TE-55 derived piRNAs only show a weak signature of ping-pong amplification in spermatocytes, likely due to low 56 levels or absence of Ago3 (Quénerch'du, Anand, and Toshie 2016). 57 Here we have analyzed TE expression at single-cell resolution in order to gain insight into the dynamics of TE 58 activity in Drosophila testes. We develop a novel approach for identification of TE and host gene expression 59 programs and find that a subset of primary spermatocytes expresses a diverse group of TEs at high levels 60 relative to other cell types. These TEs are co-expressed with Y-linked fertility factors and we find evidence that 61 they are more active in males compared to females. These data suggest some TEs may exploit spermatocyte-62 specific transcriptional programs and Y chromosome activation to remain active in the Drosophila 63 melanogaster genome. 64 65

67
Data processing and cell type identification 68 We reanalyzed 10x Genomics 3' single-cell expression data from a recent study examining sex chromosome 69 gene expression in D. melanogaster larval testes (Mahadevaraju et al. 2020 transposons and generated a nearest neighbors graph. We applied the Leiden algorithm (Traag,Waltman,and 83 van Eck 2019) to reveal 10 clusters, including several highly distinct clusters and several clusters with high 84 degrees of similarity, as indicated by adjacency in the UMAP embedding ( Figure 1A). We classified each of 85 these clusters using garnett (Pliner, Shendure, and Trapnell 2019) and a collection of curated markers (see 86 Methods, Supplementary Table 1) to assign each cell to a known testis cell type ( Figure 1B). 87 Our filtering approach is more conservative than applied to these data in their initial study, yielding a final 88 dataset with fewer cells than originally published (Supplementary Figure 1D). To assess the similarity of our 89 clusters with previously published clusters, we generated mean expression values per cluster for each gene 90 and computed Spearman's rank correlation for each pairwise combination of clusters from each study 91 (Supplementary Figure 1C). There is a strong correspondence between our clusters and those previously 92 identified from these data, with high pairwise correlations for every cluster previously reported, though minor 93 differences are apparent. Notably, we identify fewer distinct cyst cell clusters but more distinct spermatocyte 94 clusters than reported in the original study. 95 Similar to the previously described analysis of these data (Mahadevaraju et al. 2020), we identify distinct 96 somatic and germline clusters ( Figure 1A). We identify cyst cells (clusters 7 and 8) which express tj and wnt4 97 at high levels. Hub and terminal epithelial cells (cluster 10) are defined largely by Fas3 expression, and 98 pigment cells (cluster 9) express Sox100B ( Figure 1B). 99 The remaining cells comprise the germline components of these data. Cluster 1 contains germline stem cells 100 and early spermatogonia, marked by vasa and spn-E ( Figure 1B). A second spermatogonial cluster 101 (2/Spermatogonia) expresses spermatogonial markers such as bam and spermatocyte markers such as aly, 102 which respectively are required for GSC differentiation and initiation of a primary spermatocyte transcription 103 program. They are most transcriptionally similar to the G spermatogonia cluster identified by Mahadevaraju et 104 al. but mean normalized UMI counts for this cluster also correlate well with that study's E1 early spermatocyte 105 cluster (Supplementary Figure 1C). This observation suggests that our cluster 2 may represent spermatogonia 106 just beginning the transition to meiotic prophase or the very early spermatocytes. 107 The final four clusters (3, 4, 5, and 6) represent the majority of filtered cells ( Figure 1C) and express aly as well 108 as sa and can, which are effectors of the primary spermatocyte expression program (Beall et al. 2007; White-109 Cooper et al. 1998) ( Figure 1B). These clusters are transcriptionally similar to primary spermatocytes identified 110 previously (Supplementary Figure 1C). Mean expression in clusters 3 and 4 correlates well with the previously 111 reported early primary spermatocytes while expression in 5 and 6 correlates most highly with previously 112 reported middle and late primary spermatocyte clusters (Supplementary Figure 1C). Taken together, these 113 observations suggest that the germline clusters may be ordered from earliest to latest differentiation state by 114 the cluster numbers reported here. However, among the later putative spermatocyte clusters (4, 5, and 6) it is 115 challenging to definitively identify the differentiation order. 116 117 A spermatocyte subpopulation shows high expression of transposable elements 118 To quantify cell-type-specific TE expression in the Drosophila testis, we began by visualizing expression of all 119 TEs with at least 3 UMIs detected across all individual cells ( Figure 2). While individual somatic cyst cells and 120 pigment cells sporadically express a small number of TEs, there is no evidence for cell-type specific 121 upregulation of distinct TE families in these cells. On the other hand, a small number of TE families show high 122 expression specifically in the terminal epithelial or spermatogonia clusters. Most striking, however, are the cells 123 from cluster 3 spermatocytes, whose members uniformly express a relatively large number of TE families at 124 high levels ( Figure 2). In fact, cluster 3 spermatocytes have the most TE-derived UMIs per cell, for both depth-125 normalized and raw UMI counts (Supplementary Figure 2A, 2B). UMAP embedding and Leiden clustering was 126 performed on highly variable host genes only, suggesting that cluster 3 is transcriptionally distinct from other 127 spermatocytes independent of TE expression. 128 To verify that the detected TE expression pattern is not a technical artifact of 10X scRNA-seq, we aligned L3 129 larval testis poly-A selected RNA-seq reads generated alongside the single cell data (Mahadevaraju et al. 130 2020). Summarized pseudo-bulk expression estimates derived from the scRNA-seq data are highly concordant 131 with bulk expression both globally and with respect to TEs specifically (Supplementary Figure 1A,  approach clusters the results of many iterations of NMF to buffer the influence of outlier solutions yielded by 150 single runs of the algorithm. However, when we implemented this approach, we found that it yielded large 151 GEPs utilized by broad cell types. We therefore chose to use ICA factorization because this approach was able 152 to group genes into smaller GEPs expressed specifically by smaller cell populations. 153 We applied this consensus approach to ICA to address the issue of ICA solution randomness. We additionally 154 applied a grid search approach to choose the two most important parameters of our pipelinethe number of 155 components (k) to decompose the single cell expression matrix into and the appropriate cutoff (q) for 156 identifying the distinct members of each GEP (see Methods). 157 We ran our consensus ICA approach three times for combinations of selected k-values from 10 to 150 and q-158 value cutoffs from 0.001 to 0.16. We assessed the biological interpretability of the candidate solutions by 159 enrichment for Gene Ontology Biological Process (GO:BP) terms. We found that optimizing only for maximum 160 percentage of GO:BP-enriched GEPs yielded mostly large GEPs associated with very general biological 161 processes. Under the assumption that a maximally interpretable set of GEPs should capture a wide range of 162 biological processes and should favor discovery of minimally redundant GEPs, we then calculated two scores: 163 one based on the breadth of GO:BP enrichments in a given ICA solution and the other on the unique 164 assignments of GO terms to GEPs (see Methods). These scores are highly reproducible across replicate runs 165 of cICA (Supplementary Figure 2A). 166 We combined these two metrics into a joint score for each combination of k-and q-parameters. We selected 167 an optimal combination of k (90) and q (0.005) and used the GEPs identified from independent runs as our 168 working GEPs for this study. The GEPs identified using this approach range in size from 10 genes to over 600, 169 with 75% of identified GEPs containing 200 or fewer genes (Supplementary Figure 2B). Sixty-nine percent of 170 identified GEPs were enriched at p < 0.05 for a Biological Process GO term not enriched in any other GEP 171 (Supplementary Figure 3C). 172 Our method identified many GEPs with at least one TE included alongside host genes, but a single GEP (GEP-173 27) included over 70 transposons along with approximately 300 host genes ( Figure 3A, 3B). All major classes 174 of TEs are represented in this GEP, including LTR and non-LTR retroelements and DNA transposons. 175 Interestingly, we find that these TEs are enriched for elements located within the flamenco piRNA cluster, 176 which is involved in TE suppression in ovarian follicle cells ( on the UMAP projection and observed that it is expressed exclusively by cells in cluster 3, in agreement with 183 our visual inspection of TE expression across the dataset ( Figure 3C). These results suggest that a burst of TE 184 expression occurs in a distinct subcluster of primary spermatocytes in the larval testes. 185 We identified EAChm, a host gene TEP member highly expressed in the TEP-expressing population ( Its role in spermatogenesis is currently unknown. We next performed multiplexed RNA-FISH in whole mount 189 L3 testes for EAChm and two TEP-TEs, ACCORD2 and QUASIMODO2 ( Figure 4A) to confirm co-expression 190 of TEP-TEs with host genes in 3/Spermatocyte cells. We find that EAChm, ACCORD2, and QUASIMODO2 191 show similar spatial patterns of expression, consistent with their membership in the same gene expression 192 program ( Figure  Given that the TEP-TEs are co-expressed with Y chromosome fertility genes, we hypothesized that their 227 upregulation is due to activation of Y-linked copies of these TEs. To address this hypothesis, we first 228 investigated whether TEP-TEs do indeed have copies that are located on the Y chromosome. 229 We first used RepeatMasker to identify transposon insertions in a recently published Drosophila melanogaster 230 Iso1 strain genome assembly with improved Y chromosome content (Chang and Larracuente 2019) compared 231 with the current D. melanogaster Release 6 reference sequence. We found that 70% of TEP-TEs have at least 232 one full-length copy located on a known Y-linked scaffold (Supplementary Figure 8D) and a significantly larger 233 percentage of TEP-TE insertions are found on the Y chromosome compared with other expressed TEs (Chi-234 square test P= 2.29e-292, Figure 5A). We also estimated male-specific TE copy numbers by performing 235 Illumina whole-genome sequencing (WGS) of males and females from strain w1118. We found that TEP-TEs 236 have significantly elevated copy numbers in males, compared to females, as expected if these TEs have 237 insertions located on the Y chromosome (Wilcoxon Rank-Sum test P=0.0035, Figure 5B). 238 Given that TEP-TEs are enriched on the Y chromosome, we next assessed whether their Y-linked copies are 239 over-expressed in testes relative to their autosomal and X-linked copies. We used male and female WGS 240 reads from w1118 to identify male-specific (i.e. Y-linked) single-nucleotide variants in TEP-TEs. We then 241 compared the relative abundance of each male-specific variant in testes RNA-seq data to its relative 242 abundance in male WGS data (see Methods). A ratio larger than 1 indicates the presence of one or more Y -243 linked TE insertions that are expressed more highly than total-copy number alone would explain. For each 244 expressed TE, we found the site of the male-specific allele most overexpressed relative to WGS depth. At 245 these sites, male-specific TEP-TE alleles have significantly higher ratios of relative RNA to DNA coverage 246 compared to male-specific alleles from non-TEP-TEs (Wilcoxon Rank-Sum test P=1.7e-07, Figure 5C). To 247 confirm that this effect is specific to Y-linked TEP-TEs, we repeated our analysis using reference sites as well 248 as autosomal (i.e. present in both males and females) variants. Contrary to the Y-linked TEP-TEs, these were 249 expressed proportionately to WGS depth with no difference in expression proportion between TEP-TEs and 250 other TEs (Wilcoxon Rank-Sum test P=0.24, Supplementary Figure 8F). TEs. We found that non-TEP-TEs exhibit similar X and autosomal insertion rates across the DGRP lines 258 whereas TEP-TEs exhibit a significantly reduced frequency of X-linked insertions relative to autosomal 259 insertions, consistent with male-biased activity (Wilcoxon Rank-Sum test P= 0.029, Figure 5D). Yamashita 2019). The transcription of three of these genes, kl-5, kl-3, and ks-1, is associated with the 297 formation of large Y chromosome lampbrush loops (Bonaccorsi et al. 1988 Interestingly, another 20 TEP-TEs, including gypsy, have insertions located within the flamenco piRNA cluster. Quénerch'du, Anand, and Toshie 2016). Our analysis of scRNA-seq data confirm these findings ( Figure 5E). 319 Why do spermatocytes show a weakened piRNA response at a developmental timepoint when the TE-rich Y 320 chromosome is de-repressed? One possibility is related to intragenomic conflict. Single cell RNA-seq data was downloaded from PRJNA548742 and PRJNA518743. We used 10X Genomics 352 cellranger software to align and quantify the data (Zheng et al. 2017). We generated a cellranger index from 353 the previously described custom reference sequence using cellranger's "mkref" command with default 354 parameters. We aligned scRNA-seq reads using cellranger's "count" command with default parameters. We 355 used cellranger's filtered count matrices for further analysis. 356 We first summed counts assigned to the LTR and internal sequences of class I LTR retrotransposons. For 357 each scRNA-seq replicate, we next applied scrublet v0.2.1 (Wolock, Lopez, and Klein 2019) to these 358 unnormalized count matrices to identify and filter putative heterotypic doublets. We used scanpy v1.6.0 (Wolf,359 Angerer, and Theis 2018) to retain genes detected in at least 3 cells and then cells with at least 250 and fewer 360 than 5000 detected genes. We removed cells with more than 5% of remaining UMIs assigned to 361 mitochondrion-encoded genes. We normalized UMI counts to 10000 per cell and applied log transformation 362 with a pseudo-count of 1. We identified highly variable genes using scanpy's "highly_variable_genes" method 363 with default parameters. We next scaled counts using scanpy's "scale" method and applied scanpy's 364 "regress_out" method to remove count variance associated with cell cycle and mitochondrial UMI counts. 365 For each replicate, we used scanpy to perform principal component analysis on highly variable host genes and 366 calculate nearest neighbor graphs using 15 principal components and 25 neighbors. We called cell clusters 367 using the Leiden algorithm (Traag, Waltman, and van Eck 2019) via scanpy with a resolution parameter of 368 0.35. We combined all three larval scRNA-seq replicates using scanpy's "ingest" method. 369 Automated cell type assignment was performed using Garnett v0.2.17 (Pliner, Shendure, and Trapnell 2019) 370 and a set of curated marker genes (Supplementary Table 1). 371

Consensus ICA for GEP Detection 372
We chose ICA to identify gene expression programs because it performs highly with respect to recovering 373 known functional gene modules and because it is easily adaptable to finding partially overlapping modules 374 (Saelens, Cannoodt, and Saeys 2018). We standardized the normalized, log-transformed expression matrix to 375 have zero mean and unit variance. Standardized scores were clipped to a maximum absolute value of 10. 376 To generate stable modules that are robust to stochastically varying ICA solutions, we applied a consensus 377 approach previously applied to non-negative matrix factorization gene module detection (Kotliar et al. 2019).

378
We used FastICA via sklearn (Pedregosa et al. 2012) to decompose the standardized expression matrix into 379 90 components 100 times, then concatenated the resulting gene x module matrices, partitioned all modules 380 into 90 clusters using k-means clustering, and averaged the per-cell scores within each partition to yield a 381 consensus cell x module matrix. Within the same partitions, we averaged per-gene scores from cell x module 382 matrices to generate a consensus cell x module matrix. 383 We assigned genes to each program by applying fdrtool (Strimmer 2008) to the vector of gene weights for 384 each module. Genes with FDR q-values less than 0.005 for each module were considered members of the 385 module. 386

GEP parameter optimization 387
Use of ICA or other matrix decomposition approaches for gene program detection requires a priori 388 assumptions about the optimal number of components (k) to request from the decomposition algorithm. 389 Additionally, generation of discrete gene lists for each gene program requires application of arbitrary score 390 cutoffs to determine program membership for each gene. 391 To reduce bias and use of arbitrary cutoffs, we used a grid search approach to choose k and the q-value cutoff 392 for membership. Briefly, we ran consensus ICA in triplicate for combinations of q-value cutoffs between 0.005 393 and 0.1 and k between 20 and 120. We performed pathway enrichment analysis for each program discovered 394 in each consensus ICA replicate and for each run calculated the percentage of GO:BP terms with significant 395 enrichment as well as the percentage of programs in each run that show a unique significant enrichment. We 396 then rescaled these scores to a maximum of 1 and calculated a joint score by multiplying them together. For 397 our final set of gene programs, we ran consensus ICA a final time with the k and q-value that maximized the 398 average joint score across all three test replicates. lacking supporting female reads. 429

Total RNA-seq library preparation 430
We used approximately 100 pairs of testes from 3-5-day old mated w1118 males. The testes were dissected 431 in 1X PBS and transferred into 200 µL RNAlater Solution. Tissue was pelleted by centrifuging at 5000g for 1 432 min at 4 °C. Supernatant was removed and 300 µL 1x DNA/RNA Shield was added before homogenization 433 with an electric pestle. Homogenized tissue was digested with Proteinase K at 55 °C for at least 30 min. RNA 434 was purified with the Zymo Quick-RNA Plus Kit (R1057). 435 Using up to 5 µg total RNA, ribosomal RNAs were removed suing iTools rRNA depletion Kit from Galen 436 Laboratory Supplies (dp-P020-000007) and Thermo Fisher MyOne Streptavidin C1 Dynabeads (#65001    Tallies of TE classes found in each GEP containing at least 1 TE. GEP27 contains almost fourfold more TEs than the next most TErich GEP and is predominately composed of LTR retrotransposons (59%%), LINE (29%) and DNA (9%) elements. B) GEP27 contains over 300 total features, the vast majority of which are either proteincoding genes (68%), TEs (18.6%), or noncoding RNAs (12%). C) UMAP projection colored by GEP27 expression score. Expression score is derived from the consensus (averaged) ICA source matrix.