ABSTRACT
Transcriptomes are dynamic, with cells, tissues, and body parts expressing particular sets of transcripts. Transposons are a known source of transcriptome diversity, however studies often focus on a particular type of chimeric transcript, analyze single body parts or cell types, or are based on incomplete transposon annotations from a single reference genome. In this work, we have implemented a method based on de novo transcriptome assembly that minimizes the potential sources of errors while identifying a comprehensive set of gene-TE chimeras. We applied this method to head, gut and ovary dissected from five Drosophila melanogaster natural populations, with individual reference genomes available. We found that 18.6% of body part specific transcripts are gene-TE chimeras. Overall, chimeric transcripts contribute a median of 38% to the total gene expression, and they provide both DNA binding and catalytic protein domains. Our comprehensive dataset is a rich resource for follow-up analysis. Moreover, because transposable elements are present in virtually all species sequenced to date, their relevant role in spatially restricted transcript expression is likely not exclusive to the species analyzed in this work.
INTRODUCTION
In contrast to the genome, an animal’s transcriptome is dynamic, with cell types, tissues and body parts expressing particular sets of transcripts1–4. The complexity and diversity of the transcriptome arises from the combinatorial usage of alternative promoters, exons and introns, and polyadenylation sites. A single gene can, therefore, encode a rich repertoire of transcripts that can be involved in diverse biological functions, and contribute to adaptive evolution and disease (e.g., 5–8). The potential contribution of transposable element (TE) insertions to the diversification of the transcriptome was analyzed soon after the first whole-genome sequences were available9–13. TEs are present in virtually all genomes studied to date, are able to insert copies of themselves in the genome and, although their mutation capacity is often harmful, they also represent an important source of genetic variation14–17. While transposable elements are a known source of transcriptome diversity, the majority of studies so far rely on incomplete transposon annotations from a single reference genome (e.g., 12). Moreover, methodologies are often specifically designed for particular types of chimeric gene-TE transcripts, e.g. TE-initiated transcripts18, particular types of TEs, e.g. L1 chimeric transcripts19, or have been applied to individual cell types or body parts, (e.g., 20,21). As such, our knowledge on the contribution of TEs to gene novelty is still partial.
Two of the most studied mechanisms by which TEs can generate chimeric transcripts are by providing alternative promoters and protein domains. In human and mouse, 2.8% and 5.2% of the total transcript start sites occurred within retrotransposons22. In D. melanogaster, over 40% of all genes are expressed from two or more promoters, with at least 1,300 promoters contained in TEs23. As well as individual examples of TEs providing protein domains24–26, a comparative genomic analysis of tetrapod genomes revealed that capture of transposase domains is a recurrent mechanism for novel gene formation27. There is also evidence for the retrotransposon contribution to protein novelty. Approximately 9.7% of endogenous retrovirus open reading frames across 19 mammalian genomes evolve under purifying selection and are transcribed, suggesting that they could have been co-opted as genes28. Across insects, and depending on the methodology used, the percentage of newly emerged domains (<225 mya) due to TEs was estimated to be 1.7% to 6.6%29. However, studies that identify and characterize a comprehensive set of gene-TE chimeras to provide a complete overview of their contribution to both transcriptome and protein diversification are still missing.
Besides describing the diverse contributions of TEs to the transcriptome, analyzing the relative contribution of gene-TE chimeras to the total gene expression is highly relevant, as it is informative of the potential functional relevance of the transcripts identified. Studies performed so far suggest that this contribution is related to the position of the TE in the transcript.
Transcripts with a TE inserted in the 5’UTR or internal coding exons show significantly lower mean levels of expression compared with non-chimeric TE-gene transcripts20. TEs inserted in 3’UTRs were associated with reduced gene expression both in humans and mice, but with increased gene expression in human pluripotent stem cells20,22. In addition, whether specific TE types contribute to tissue-specific expression has been explored in mammals, where retrotransposons were found to be overrepresented in human embryonic tissues22,30. In D. melanogaster, the contribution of TEs to tissue specific expression has only been assessed in the head, with 833 gene-TE chimeric genes described21. Thus, whether the contribution of chimeric gene-TE transcripts is more relevant in the D. melanogaster head compared with other body parts is still an open question.
Within genes, TEs could also affect expression by changing the epigenetic status of their surrounding regions. In Drosophila, repressive histone marks enriched at TEs spread beyond TE sequences, which is often associated with gene down-regulation31. However, there is also evidence that TEs containing active chromatin marks can lead to nearby gene overexpression32. Genome-wide, the joint assessment of the presence of repressive and active chromatin marks has been restricted so far to the analysis of four TE families33 and has never been carried out in the context of chimeric gene-TE transcripts.
In this work, we performed a high-throughput analysis to detect, characterize, and quantify chimeric gene-TE transcripts in RNA-seq samples from head, gut, and ovary dissected from the same individuals belonging to five natural strains of D. melanogaster (Figure 1A34). We implemented a method based on de novo transcriptome assembly that (i) minimizes the potential sources of errors when detecting chimeric gene-TE transcripts; and (ii) allows to identify a comprehensive dataset of transcripts rather than focusing on particular types (Figure 1B3535. Additionally, we assessed the coding potential and the contribution of chimeric transcripts to protein domains and gene expression as proxies for their integrity and functional relevance. Finally, we took advantage of the availability of ChIP-seq data for an active and a repressive histone mark, H3K9me3 and H3K27ac, respectively obtained from the same biological samples to investigate whether the TEs that are incorporated into the transcript sequences also affect their epigenetic status.
RESULTS
10% of D. melanogaster transcripts, across body parts and strains, are gene-TE chimeras
We performed a high-throughput analysis to detect and quantify chimeric gene-TE transcripts in RNA-seq samples from head, gut, and ovary, in five D. melanogaster strains collected from natural populations (Figure 1A). The three body parts were dissected from the same individuals, and an average of 32x (22x to 43x) per RNA-seq sample was obtained (3 replicates per body part and strain, Table S136. We de novo assembled transcripts in which we annotate TE insertions using the new D. melanogaster manually curated TE library34. We only considered de novo transcripts that overlap with a known transcript obtained from a reference guided assembly (Figure 1B). We then used the reference genome of each strain to define the exon-intron boundaries of each transcript and to identify the position of the TE in the transcript (Figure 1B). The alignment with the reference genome and the accurate TE annotation also allowed us to discard single-unit transcripts, indicative of pervasive transcription, and TE autonomous expression, which are two important sources of errors when quantifying the contribution of TEs to gene novelty (Figure 1B35).
Overall, considering all the transcripts assembled in the three body parts and the five strains, we identified 2,169 chimeric gene-TE transcripts belonging to 1,250 genes (Table S2A). Thus, approximately 10% (2,169/21,786) of D. melanogaster transcripts contain exonic sequences of TE origin. In individual strains, this percentage ranged between 5.4% to 6.7% (842-1,013 chimeric transcripts per genome) indicating that most of the chimeric gene-TE transcripts are strain-specific, as expected given that the majority of TEs are present at low population frequencies (Figure 1C34). While the overall contribution of TEs to the transcriptome is 10%, TEs contribute 18.6% (1,295/6,959) of the total amount of body part specific transcripts (Figure 1C).
We identified two groups of chimeric gene-TE transcripts (Figure 1D). The first group contains chimeric transcripts which have a TE overlapping with the 5’UTR, the 3’UTR, or introducing alternative splice (AS) sites (overlap and AS insertions group: 977 chimeric transcripts from 655 genes). While TEs have been reported to introduce non-canonical splice motifs21, we found that the majority of the TEs in the overlap and AS insertions group were adding a canonical AS motif (65.2%: 172/264) (Table S2B). The second group contains chimeric gene-TE transcripts in which the TE is annotated completely inside the UTRs or internal exons (internal insertions group: 1,587 transcripts from 890 genes) (Figure 1D). We hypothesized that this group could be the result of older insertions that have been completely incorporated into the transcripts. Indeed, we found that TEs in this group are shorter than those of the overlap and AS insertion group, as expected if the former are older insertions (75.99% vs. 23.75%; test of proportions, p-value < 0.001; Figure S1; see Methods). Additionally, while the majority of gene-TE transcripts in the overlap and AS insertions group were strain-specific, we found more transcripts shared between strains than strain-specific in the internal insertions group (test of proportions, p-value < 0.001; Figure S2A and Table S2C). This observation is also consistent with this group being enriched for older insertions, and remained valid when we removed the shorter insertions (test of proportions, p-value < 0.001; Table S2C).
To test whether the overlap and AS insertions and the internal insertions groups contribute differently to the diversification of the transcriptome, we performed all the subsequent analyses considering all the chimeric transcripts together, and the two groups separately. In addition, because shorter insertions might be enriched for false positives, i.e., not corresponding to real TE sequences due to the difficulty of annotating these repetitive regions, we also performed the analysis with the subset of chimeric gene-TE transcripts that contains a fragment of a TE insertion that is ≥120bp (831/977 and 628/1587 for the overlap and AS insertions and the internal insertions groups, respectively; see Methods).
Gene-TE chimeric transcripts are more abundant in the head
Using high-throughput methodologies 833 chimeric genes were identified in the D. melanogaster head21, however, the relative amount of chimeric gene-TE transcripts across body parts has never been assessed before. We found that the majority of the assembled chimeric gene-TE transcripts across the five strains analyzed were body part specific (60%: 1,295/2,169), with only 17% (368) shared across all three body parts (Figure 2A and Table S3A). The same pattern was found for the overlap and AS insertions group and for the internal insertions group, when considering all insertions and those ≥120bp (Figure S2B and Table S3A).
Head was the body part expressing the most chimeric transcripts (1,459) followed by gut (1,068) and ovary (884) (Figure 2A and Table S3A). Note that 208 of the chimeric transcripts identified in this work were previously described by Treiber and Waddell (2020) 21. After accounting for differences in the total number of transcripts assembled in each body part, we still observed that the head was expressing more chimeric transcripts compared to gut and ovary (8.54% head vs. 6.61% gut and 7% ovary; test of proportions, p-value = 3.89×10-11 and 2.14×10-7, respectively; Table S3B). On the other hand, the proportion of total transcripts that are chimeric was similar between gut and ovary (test of proportions, p-value = 0.337) (Table S3C). A higher proportion of chimeric transcripts in head compared with gut and ovary was also found when the overlap and AS insertions and the internal insertions groups were analyzed separately, although in this last group the proportion across body parts is similar if we focus on ≥120bp insertions (Figure 2B and Table S3C). Overall, the same patterns were also found at the strain level, except for JUT-011 and MUN-016, where some comparisons were not significant (Table S3C).
Finally, the head was also the body part that expressed the most body part specific chimeric transcripts (48% head vs. 29% gut; test of proportions, p-value < 0.001, and vs. 30% ovary, p-value < 0.001), while no differences were found between gut and ovary (30% ovary vs. 29% gut; test of proportions, p-value = 0.7; Figure 2A). In the three body parts, these proportions were higher than the total proportion of body part specific transcripts (21.3%, 13.1% and 9.4%, for head, gut and ovary respectively; test of proportions, p-values < 0.001 for all comparisons; Table S3B).
Most chimeric transcripts contain TE insertions in the 3’UTRs
Chimeric gene-TE transcripts are enriched for TE insertions located in the 3’UTRs in D. melanogaster and in mammals12,13,20. Consistently, we also found that most of the chimeric gene-TE transcripts contain a TE in the 3’UTR (1,084 transcripts from 662 genes) followed by internal exons (924 transcripts from 529 genes) and insertions in the 5’ UTRs (703 transcripts from 499 genes). Note that 34 of the 5’ UTR insertions detected in this work were experimentally validated in a previous analysis that estimated the promoter TE usage across developmental stages in D. melanogaster23. Indeed, the number of chimeric genes with a TE inserted in the 3’ and 5’ UTRs is higher than expected when taking into account the proportion of the genome that is annotated as UTRs, while there is a depletion of TEs in internal exons (test of proportions, p-value < 0.001 in the three comparisons; Table S4A). It has been hypothesized that the higher number of insertions in 3’ UTRs could be explained by lack of selection against insertions in this gene compartment11,12. We thus tested whether 3’UTR chimeric transcripts were enriched for TE insertions present in more than one genome. However, we found an enrichment of unique insertions in 3’UTR chimeric transcripts suggesting that they might be under purifying selection (test of proportions, p-value = 0.033; Figure 3A and Table S4B).
While in the overlap and AS insertions group, TE insertions were also mainly located in the 3’ UTRs (53.4%: 260/487), in the internal insertions group there were more chimeric transcripts with TE insertions found in internal exons than in the 3’UTR (448 vs. 343; test of proportions, p-value < 0.001). This pattern still holds when we only consider ≥120bp insertions (166 vs. 125; test of proportions, p-value = 0.047; Table S4C). Figure 3B shows the number of chimeric gene-TE transcripts globally and by insertion group, body part and strain (Table S4D) where it can be observed that, overall, the previous patterns hold at the body part level.
Chimeric gene-TE transcripts are enriched for retrotransposon insertions
We assessed the contribution of TE families to chimeric gene-TE transcripts. We found that the majority of TE families, 111/146 (76%), were detected in chimeric gene-TE transcripts, as has been previously described in head chimeric transcripts (Table S5A21,34). Although retrotransposons are more abundant than DNA transposons (61% on average in the five genomes analyzed34, the contribution of retrotransposons to the chimeric gene-TE transcripts was higher than expected (81%: 90/111; test of proportions, p-value < 0.001; Table S5B). There were slightly more families contributing to the overlap and AS insertions group than to the internal insertions group (98 vs. 82, respectively, test of proportions, p-value = 0.01), but both groups were enriched for retrotransposons (test of proportions, p-value < 0.001 and p-value = 0.0179, respectively; Table S5C). More than half of these families (64: 57.7%) contribute to chimeric transcripts in all body parts, while 24 families were body part-specific, with 12 being head-specific, 6 gut-specific and 6 ovary-specific (Table S5A).
The most common TE families found were roo (33.2%) and INE-1 (25.8%) (Figure 4). Indeed, these two families were over-represented in the chimeric transcripts dataset when compared to their abundance in the genome: roo in the five strains (test of proportions, p-value < 0.0001 for all comparisons) and INE-1 in AKA-017 and SLA-001 (test of proportions, p-value = 0.004, and p-value < 0.0001, respectively) (Table S5D). Roo and INE-1 were also the most common families both in the overlap and AS insertions group (16.3 and 24.4%, respectively) and in the internal insertions group (44.8% and 29.4%, respectively). The same pattern was found when we analyzed only those chimeric transcripts with TEs ≥120bp (Figure S3 and Table S5E).
Because roo insertions were enriched in all the strains analyzed, we further investigate these TE sequences. We found only two types of roo insertions: solo LTRs (23 insertions), that all belong to the overlap and AS insertions group, and a short (45bp-217bp) low complexity sequence mapping to the positions 1,052-1,166 of the canonical roo element (see Methods). This short roo sequence is more common in the internal insertions group than in the overlap and AS insertions group (911 vs. 61 insertions, respectively). Note that a recent analysis by Oliveira et al. 43 also found this same region of the roo consensus sequence to be the most abundant in chimeric gene-TE transcripts across four D. melanogaster strains43. The authors evaluated whether these short sequences were widespread repeats across the genome. They found that the majority of the roo fragments they identified (97.45%) have only one blast hit in the genome, suggesting that they are not. We argued that if these low complexity regions have a roo origin, we should find that at least some of them should also have a blast hit with a roo insertion. To test this, we used less strict blast parameters compared with Oliveira et al.43 and found that 57 of the low complexity regions have a roo element insertion as the second best hit and 148 have a roo insertion in the top 5 hits, suggesting that indeed some of these sequences have a clear roo origin (Table S5F). Furthermore, we also tested whether this low complexity region was present in the roo consensus sequence from a closely related species, D. simulans, and found that this was the case strongly suggesting that this low complexity sequence is an integral part of the roo element.
We further investigated why this roo low complexity region was incorporated into genes. Because TEs can contain cis-regulatory DNA motifs, we performed a motif scan of the low complexity sequence from the canonical roo element. We found a C2H2 zinc finger factor motif repeated six times in this region. Note that this motif is only found once in the roo consensus sequence outside the low complexity region. A scan in the roo sequences from the chimeras revealed that 78% (753/972) of the transcripts with the low complexity roo sequence contains at least one sequence of this zinc finger motif, with 26% (196/753) containing 3 or more (Table S5G).
Chimeric gene-TE transcripts contribute a median of ~38% of the total gene expression
Besides identifying and characterizing chimeric gene-TE transcripts, we quantified the level of expression of both chimeric and non-chimeric transcripts genome-wide. We focused on transcripts with ≥1 TMM in at least one of the samples analyzed (1,779 out of 2,169 chimeric transcripts, corresponding to 86% (1,074/1,250) of the genes (see Methods). We found that chimeric gene-TE transcripts have lower expression levels than non-chimeric transcripts (17,777; Wilcoxon’s test, p-value < 0.001, Figure 5A). This is in contrast with previous observations in human pluripotent stem cells that reported no differences in expression between chimeric and non-chimeric transcripts20. We dismissed the possibility that the lower expression of chimeric gene-TE transcripts was driven by the roo low complexity region identified in 995 of the chimeric transcripts (Wilcoxon’s test, p-value < 0.0001; Figure 5A). Lower expression of the chimeric gene-TE transcripts was also found at the body part and strain levels and when we analyzed the overlap and AS insertions and internal insertions groups separately (Wilcoxon’s test, p-value < 0.001 for all comparisons; Figure 5A and Table S6A).
We further tested whether TEs inserted in different gene locations differed in their levels of expression compared with the non-chimeric TE transcripts. We found that chimeric transcripts had significantly lower expression than non-chimeric transcripts regardless of the insertion position (Wilcoxon’s test p-value < 0.001 for all comparisons; Figure 5A). Furthermore, insertions in the 3’UTR appeared to be more tolerated than those in 5’UTR and internal exons, as their expression level was higher (Wilcoxon’s test, p-value < 0.005 for both comparisons; Figure 5A). Our results are consistent with those reported by Faulkner et al.22 who also found that 3’UTR insertions reduced gene expression.
If we focus on the chimeric genes, 24% of them (259 genes) only expressed the chimeric gene-TE transcript (in all the genomes and body parts where expression was detected). Most of these genes (70%) contain short TE insertions and accordingly most of them belong to the internal insertions group (93%) (test of proportions, p-value < 0.001). For the other 76% (815) of the genes, we calculated the average contribution of the chimeric gene-TE transcript to the total gene expression per sample. While some genes contributed only ~4% of the total gene expression, others accounted for >90% (median = 22.7%) (Figure 5B). The median contribution to gene expression of the internal insertions group is higher than that of the overlap and AS insertions group, when considering all the insertions (25% vs. 14.3%, respectively; Wilcoxon’s test, p-value < 0.001), and when analyzing only those transcripts with ≥120bp insertions (20% vs. 14.29%, respectively; Wilcoxon’s test, p-value = 0.0015). Considering only the transcripts that do not contain the roo low complexity sequence, the median contribution to gene expression of the internal insertions group was still 20%. Overall, taking all chimeric genes into account (1,074), the median of the chimeric gene-TE transcripts’ expression contribution to the total gene expression was 38%.
Finally, we evaluated whether there are differences between the expression levels of body part-specific and body part-shared chimeric transcripts. The breadth of expression, measured as the number of tissues in which a gene is expressed, is significantly and positively correlated with the level of expression in Drosophila44 and humans45. Consistent with this, we found that body part-shared chimeric transcripts have significantly higher expression levels than chimeric transcripts expressed in only one body part (Wilcoxon’s test, p-value < 0.001; Table S6B), when considering the whole dataset and for chimeric transcripts with insertions ≥120bp (Wilcoxon’s test, p-value < 0.001; Table S6B). Since we observed that the head was expressing more chimeric transcripts (Figure 2A), we next assessed if head-specific chimeric transcripts were also expressed at higher levels. We observed that the median expression of head-specific chimeric transcripts was higher than those specific of gut (medianhead= 5.18 TMM [n = 527], mediangut= 3.8 TMM [n = 205]; Wilcoxon’s test, p-value = 0.0021), but lower than ovary-specific chimeric transcripts (medianovary= 8.52 TMM [n = 210]; Wilcoxon’s test, p-value = 1.35×105). However, this is similar to the expression level of genes in these tissues (median of gene expression in ovary>head>gut: 20.2>9.7>8.5).
Interestingly, strain-shared chimeric transcripts (expressed in the five strains) also have significantly higher expression levels than strain-specific chimeric transcripts (Wilcoxon’s test, p-value < 0.001; Table S6C).
11.4% of the TEs within chimeric gene-TE transcripts could also be affecting gene expression via epigenetic changes
We tested whether TEs that are part of chimeric transcripts could also be affecting gene expression by affecting the epigenetic marks. We used ChIP-seq experiments previously performed in our lab for the three body parts in each of the five strains analyzed for two histone marks: the silencing mark H3K9me346,47 and H3K27ac, related to active promoters and enhancers48,49. We focused on polymorphic TEs because for these insertions we can test whether strains with and without the insertion differed in the epigenetic marks (755 genes). For the majority of these genes (534), we did not observe consistent epigenetic patterns across samples with and without the TE insertion, and these genes were not further analyzed. Additionally, 86 genes did not harbor any epigenetic marks while 49 genes contained the same epigenetics mark(s) (H3K27ac, H3K9me3, or both marks) in strains with and without that particular TE insertion (Table S7). Overall, only for 11.4% (86/755) of the genes, we observed a consistent change in the epigenetic status associated with the presence of the TE. This percentage is similar for the overlap and AS group and the internal insertion group (10.4% and 11.8%, respectively). The majority of TEs showing consistent changes in their epigenetic status were associated with gene down-regulation (50/86; Table 1). While 70% (534/755) of the genes analyzed were expressed in the head, only 57% (49/86) differed in their epigenetic marks (test of proportions, p-value = 0.03).
Gene-TE chimeric transcripts are enriched for DNA binding molecular functions involved in metabolism and its regulation, and development
To get insight on the biological processes and molecular functions in which the gene-TE chimeric transcripts are involved, we performed a gene ontology (GO) clustering analysis50. We analyzed the chimeric genes detected in each body part separately, using as a background the total genes assembled in the corresponding body part. We found that chimeric genes are enriched in general cell functions, such as metabolism and its regulation, and development (Figure 6A and Table S8A). Some functions are particular to a body part, e.g. response to stimulus and signaling in the head, anatomical structure development and regulation, and signaling and communication in the gut, and cellular component organization in the ovary. Note that the overlap and AS insertions group is enriched for cellular component organization, and nucleosome and cilium assembly and organization, across tissues (Figure 6A and Table S8C).
Finally, regarding the molecular function, chimeric genes are enriched for DNA binding processes and RNA polymerase II transcription across body parts (Figure 6B and Table S8B), while in head they are also enriched for transmembrane transporter activity and in ovary for transcription factor activity.
Both DNA transposons and retrotransposons add functional protein domains
We next assessed whether TE sequences annotated in internal exons provided functional domains. We first confirmed, using the Coding Potential Assessment Tool (CPAT51) software, that the majority of chimeric protein-coding gene-TE transcripts that have a TE annotated in an internal exon have coding potential (95.12%: 858/902; Table S9A). Using PFAM52, we identified a total of 27 PFAM domains in 36 different chimeric transcripts from 29 genes (Table 2 and Table S9B). These 27 domains were identified in 24 TE families, with 16 TE families providing more than one domain. The size of these domains ranged from 9bp to 610bp (mean of 123.5bp; Table S9B). Note that 10 of these 29 chimeric genes have been previously described in the literature (Table 2). Most of the transcripts (67%: 24/36) belong to the overlap and AS insertions group. Finally, we found chimeric transcripts adding domains in the three tissues analyzed (Table 2), with an enrichment in ovary compared to head (test of proportions, p-value = 0.027).
The majority of TEs adding domains were retrotransposons (22/29) and most TEs provided a nearly-full domain (24/29, ≥50% coverage), including 9 TEs adding a full-size domain (Table 2). Almost 30% (9/29) of the chimeric genes are related to gene expression functions and 20% (5/29) are related to cell organization and biogenesis (Table S9C). The majority of these chimeric genes (21/29) have evidence of expression, ranging from 1.05 to 47.14 TMM (Table 2, median = 8.26 TMM). The median expression was higher for the transcripts with complete domains compared to partially/uncompleted domains (median TMM 22.16 vs. 9.03), although the difference was not statistically significant (Wilcoxon’s test, p-value = 0.08). The majority of TEs for which the population TE frequency has been reported, are fixed or present at high frequencies (12/22 TEs; Table 2).
We assessed if the domains detected in the TE fragment of the gene-TE chimera were also found in the consensus sequence of the TE family. Because most TE families were providing more than one domain, in total we analyzed 54 unique domains. We were able to find the domain sequence for 50 unique domains from 20 TE consensus sequences (Table S9D). Note that for five of these domains (from four TEs), we had to lower PFAM detection thresholds to detect them (see Methods). The four domains that were not identified in the consensus sequences, were smaller than the average (ranging 18bp-101bp, mean: 62.25bp) and were not detected in the chimeric fragments as full domain sequences.
A PFAM domain enrichment analysis considering domains annotated with nearly-full domains and in transcripts expressed with minimum of 1 TMM using dcGO53, found enrichment of the molecular function nucleic acid binding (6 domains, FDR = 4.12×10-4) and catalytic activity, acting on RNA (4 domains, FDR = 4.12×10-4) (Table 3). All the enriched domains are found in retrotransposon insertions. Consistent with the enrichment of the molecular functions, these domains were enriched in the nuclear body and in regulation of mRNA metabolic process (Table 3).
DISCUSSION
TEs contribute to genome innovation by expanding gene regulation, both of individual genes and of gene regulatory networks, enriching transcript diversity, and providing protein domains (e.g., reviewed in Chuong et al.59 and Modzelewski et al.60). While the role of TEs as providers of regulatory sequences has been extensively studied, their contribution to transcriptome diversification and protein domain evolution has been less characterized. In this work, we have identified and characterized chimeric gene-TE transcripts across three body parts and five natural D. melanogaster strains, and we have quantified their contribution to total gene expression and to protein domains. While previous studies were hindered by the incomplete annotation of TEs in the genome studied12,21, in this work, we took advantage of the availability of high-quality genome assemblies and genome annotations for five natural strains to carry out an in depth analysis of gene-TE chimeric transcripts34. We found that TEs contribute 10% to the global transcriptome and 18% to the body part specific transcriptome (Figure 1). Contrary to other studies that mostly focus on a single type of chimeric gene-TE transcript, we investigated a comprehensive dataset of chimeras. Indeed, we found that besides insertions affecting the transcription start site, transcript termination, and adding spliced sites (overlap and alternative splicing insertions), we also identified a substantial number of TE sequences that were completely embedded within exons (internal insertions; Figure 1D). These two types of chimeric gene-TE transcripts shared many properties, e.g. they were enriched for body part specific transcripts and for retrotransposons (Figure S2B and Figure 4), and they showed lower expression levels than non-chimeric transcripts (Figure 5A), suggesting that they both should be taken into account when analyzing the contribution of TEs to gene novelty. The internal insertions group contributed more to total gene expression (Figure 5B), however, we dismissed the possibility that this increased expression was due to shorter TE insertions, which are more likely to be enriched for false annotations compared with longer insertions34. We found, both based on size and frequency, that the internal insertions group is likely to be enriched for older insertions. As such, a higher level of expression of these likely older TEs is consistent with previous observations in tetrapods suggesting that over time gene-TE chimeric transcripts often become the primary or sole transcript for a gene27. Overall, and taking only into account those gene-TE chimeric transcripts with evidence of expression, we found 155 (8.6%) insertions disrupting the coding capacity, 415 (22.9%) affecting the coding capacity, 314 (17.3%) and 591 (32.6%) affecting the 5’ and the 3’ end of the gene, respectively, while 338 (18.6%) affected multiple transcript positions.
Our finding that TEs contribute to the expansion of the head transcriptome supports the results of Treiber and Wadell (2020) 21 suggesting that ~6% of genes produce chimeric transcripts in the head due to exonization of a TE insertion. However, because we also analyzed gut and ovary, we further show that TEs can significantly contribute to the expansion of other body parts transcriptomes as well (Figure 2). The observation that there are more chimeric transcripts in the head is consistent with a higher transcriptional complexity in the Drosophila nervous system tissues3. The fact that chimeric gene-TE transcripts tend to be tissue-specific could be especially relevant for adaptive evolution as tissue-specific genes can free the host from pleiotropic constraints and allow the exploration of new gene functions45,61,62.
Finally, we identified a total of 27 TE protein domains co-opted by 29 genes (Table 2 and Table S24). Ten of these genes have been previously described as chimeric based on high-throughput screenings or individual gene studies, with some of them, e.g. CHKov1 and nxf2, having functional effects54–56 (Table 2). The majority of the domains were present in the TE consensus sequences (Table S9D). Furthermore, the 27 domains identified were enriched for nucleic acid binding and catalytic activity, acting on RNA molecular functions (Table 3). Although there is evidence for DNA binding domains being recruited to generate new genes, previous data comes from a comparative genomic approach across tetrapod genomes that focused on DNA transposons as a source of new protein domains27. The available data for the genome-wide contribution of retrotransposons to protein domains so far is restricted to endogenous retroviruses in mammals28. In our dataset, that includes both DNA transposons and retrotransposons, the enrichment for DNA binding domains and for catalytic activity is indeed driven by the retrotransposon insertions (Table 2). Although most of the TEs providing protein domains identified in this work for the first time were present at low population frequencies, four were fixed and two present at high population frequencies and are thus good candidates for follow-up functional analysis (Table 2).
Although we have detected more chimeric transcripts than any prior D. melanogaster study to date, our estimate of the potential contribution of TEs to the diversification of the transcriptome is likely to be an underestimate. First, and as expected, we found that the contribution of TEs to the transcriptome is body part specific22,30 (60%, Figure S2B) and strain-specific34 (48% Figure S2A). Thus analyzing other body parts and increasing the number of genomes analyzed will likely identify more chimeric gene-TE transcripts. And second, although our estimate is based on the highly accurate annotations of TE insertions performed using the REPET pipeline34, highly diverged and fragmented TE insertions are difficult to be accurately annotated by any pipeline and as such might go undetected63,64. Still, the combination of an accurate annotation of chimeric gene-TE transcripts, with expression data across tissues, and the investigation of protein domain acquisition carry out in this work, not only significantly advances our knowledge on the role of TEs in gene expression and protein novelty, but also provides a rich resource for follow-up analysis of gene-TE chimeras.
MATERIAL AND METHODS
Fly stocks
Five D. melanogaster strains obtained from the European Drosophila Population Genomics Consortium (DrosEU), were selected according to their different geographical origins: AKA-017 (Akaa, Finland), JUT-011 (Jutland, Denmark), MUN-016 (Munich, Germany), SLA-001 (Slankamen, Serbia) and TOM-007 (Tomelloso, Spain).
RNA-seq and ChIP-seq data for three body parts
RNA-seq and ChIP-seq data for the five strains were obtained from 36. A full description of the protocols used to generate the data can be found in 36. Briefly, head, gut and ovary body parts of each strain were dissected at the same time. Three replicates of 30 4-6 old-day females each were processed per body part and strain. RNA-seq library preparation was performed using the TruSeq Stranded mRNA Sample Prep kit from Illumina, and sequenced using Illumina 125bp paired-end reads (26.4M-68.8M reads; Table S1). For ChIP-seq, libraries were performed using TruSeq ChIP Library Preparation Kit. Sequencing was carried out in a Illumina HiSeq 2500 platform, generating 50bp single-end reads (22.2M-59.1M reads; Table S1).
Transcriptome assembly
Reference-guided transcriptome assembly
To perform reference-guided transcriptome assemblies for each body part and strain (15 samples), we followed the protocol described in Pertea et al.40 using HISAT239 (v2.2.1) and StringTie40 (v2.1.2) . We used D. melanogaster r6.31 reference gene annotations65 (available at: ftp://ftp.flybase.net/releases/FB201906/dmel_r6.31/gtf/dmel-all-r6.31.gtf.gz, last accessed: October 2020). We first used extract_splice_sites.py and extract_exons.py python scripts, included in the HISAT2 package, to extract the splice sites and exon information from the gene annotation file. Next, we build the HISAT2 index using hisat2-build (argument: -p 12) providing the splice sites and exon information obtained in the previous step in the -ss and - exon arguments, respectively. We performed the mapping of the RNA-seq reads (from the fastq files, previously analyzed with FastQC66) with HISAT2 (using the command hisat2 -p 12 --dta -x). The output sam files were sorted and transformed into bam files using samtools67 (v1.6). Finally, we used StringTie for the assembly of transcripts. We used the optimized parameters for D. melanogaster provided in68 to perform an accurate transcriptome assembly: stringtie -c 1.5 -g 51 -f 0.016 -j 2 -a 15 -M 0.95. Finally, stringtie --merge was used to join all the annotation files generated for each body part and strain. We used gffcompare (v0.11.2) from the StringTie package to compare the generated assembly with the reference D. melanogaster r.6.31 annotation, and the sensitivity and precision at the locus level was 99.7 and 98.5, respectively.
De novo transcriptome assembly
A de novo transcriptome assembly was performed using Trinity38 (v2.11.0 with the following parameters: --seqType fq --samples_file <txt file with fastq directory> --CPU 12 --max_memory 78 G --trimmomatic. To keep reliable near full-length transcripts, we used blastn69 (v2.2.31) to assign each de novo transcript to a known D. melanogaster transcript obtained from the Reference-guided transcriptome assembly. Next, the script analyze_blastPlus_topHit_coverage.pl from Trinity toolkit was used to evaluate the quality of the BLAST results, and we followed a conservative approach that only kept a transcript with a coverage higher than 80% with a known D. melanogaster transcript, thus, keeping 144,099 transcripts across all samples.
Identification and characterization of chimeric gene-TE transcripts
We focused on the set of assembled de novo transcripts that passed the coverage filtering to identify putative chimeric gene-TE transcripts. We tried to minimize the possible sources of confounding errors by excluding transcripts that were not overlapping a known transcript (tagged by StringTie as possible polymerase run-on or intergenic). To annotate TEs in the de novo assembled transcripts, we used RepeatMasker41 (v4.1.141 with parameters -norna - nolow -s -cutoff 250 -xsmall -no_is -gff with a manually curated TE library34. Note that RepeatMasker states that a cutoff of 250 will guarantee no false positives41. We excluded transcripts for which the entire sequence corresponded to a transposable element, indicative of the autonomous expression of a TE. To infer the exon-intron boundaries of the transcript, we used minimap242 (v2.1742 with arguments -ax splice --secondary=no --sam-hit-only -C5 - t4 to align the transcript to the genome of the corresponding strain from which it was assembled. We excluded single-transcript unit transcripts, that could be indicative of pervasive transcription or non-mature mRNAs. With this process, we obtained the full-length transcript from the genome sequence.
We ran RepeatMasker again (same parameters) on the full-length transcripts to annotate the full TEs and obtain the length of the insertion. Finally, we used an ad-hoc bash script to define the TE position within the transcript and define the two insertions groups: the overlap and AS insertions group and the internal insertions group. The overlap and AS insertions group have a TE overlapping with the first (5’UTR) or last (3’UTR) exon, or overlap with the exon-intron junction and thus introduce alternative splice sites (see Splice sites motif scan analysis). The internal insertions group corresponds to TE fragments detected inside exons.
TE insertion length
As mentioned above, for each chimeric gene-TE transcript, we obtained the length of the TE insertion from the TE annotation in the full-length transcript. We considered that short insertions are those shorter than 120bp34.
Splice sites motif scan analysis
We followed Treiber and Waddell (2020) 21 approach to detect the splice acceptors and splice donor sites in the alternative splice (AS) insertions subgroup of chimeric gene-TE transcripts. In brief, we randomly extracted 11-12bp of 500 known donor and acceptor splice sites from the reference D. melanogaster r.6.31 genome. Using the MEME tool70 (v5.3.0), we screened for the donor and acceptor motifs in these two sequences, using default parameters. The obtained motifs were then searched in the predicted transposon-intron breakpoints position of our transcripts using FIMO71 (v5.3.0 with a significant p-value threshold of < 0.05).
Roo analyses
Identification of the position of the roo sequences incorporated into gene-TE chimeric transcripts in the roo consensus
To determine the position of the roo insertions, we downloaded the roo consensus sequence from FlyBase65 (version FB2015_02, available at https://flybase.org/static_pages/downloads/FB2015_02/transposons/transposon_sequence_set.embl.txt.gz). We extracted the roo fragments detected in the chimeric gene-TE transcripts using bedtools getfasta72 (v2.29.2), and used blastn69 with parameters -dust no -soft_masking false -word_size 7 -outfmt 6 -max_target_seqs 1 -evalue 0.05 -gapopen 5 -gapextend 2 (v2.2.31) to determine the matching position in the consensus sequence.
Identification of transcription factor binding sites in roo sequences
We retrieved from JASPAR73 (v2022) the models for 160 transcription factor binding sites (TFBS) motifs of D. melanogaster. We used FIMO71 (v5.3.0) to scan for TBFS in the repetitive roo sequence from the consensus sequence (region: 1052-1166), as well as in the fragments incorporated in the gene-TE chimeras, with a significant threshold of 1×10-4.
Genome-wide BLAST analysis of roo low complexity sequences
We performed a BLAST search with blastn69 (v2.2.31) (with parameters: -dust no -soft_masking false -outfmt 6 - word_size 7 -evalue 0.05 -gapopen 5 -gapextend 2 -qcov_hsp_perc 85 -perc_identity 75). Next, we used bedtools intersect72 (v2.29.2) with the gene and transposable elements annotations to see in which positions the matches occur. We analyzed the top 20 matches of each blastn search.
Identification of D. simulans roo consensus sequence
We obtained a superfamily level transposable elements library for D. simulans using REPET. We used blastn69 (v2.2.31) with a minimum coverage and percentage of identity of the 80% (-qcov_hsp_perc 80 -perc_identity 80) to find the sequence corresponding to the roo family. Then, we used again blastn69 (with parameters -qcov_hsp_perc 80 -perc_identity 80 -dust no -soft_masking false -word_size 7 - max_target_seqs 1 -evalue 0.05 -gapopen 5 -gapextend 2) to check if the roo sequence from D. simulans contained the repetitive region present in the D. melanogaster roo consensus sequence. The roo consensus sequence from D. simulans is available in the GitHub repository (https://github.com/GonzalezLab/chimerics-transcripts-dmelanogaster).
Retrotransposons and DNA transposons enrichment
We used the percentage of retrotransposons and DNA transposons of the genome of the five strains provided in Rech et al (2022) 34 and performed a test of proportions to compare this percentage to the percentage of retrotransposons and DNA transposons detected in the chimeric gene-TE transcripts dataset.
Expression level estimation
To estimate the level of expression of the whole set of transcripts assembled we used the script align_and_estimate_abundance.pl from the Trinity package38 (v2.11.0), using salmon74 as the estimation method. We next used the script abundance_estimates_to_matrix.pl from the Trinity package to obtain the level of expression of transcripts using the TMM normalization (Trimmed Means of M values). For each transcript, the expression levels of the three replicates were averaged. For the analyses, we considered transcripts with a minimum expression level of one TMM. Genes were categorized in three groups: (i) genes that were never detected as producing chimeric isoforms, (ii) genes that always were detected as producing chimeric gene-TE transcripts and (iii) genes producing both chimeric and non-chimeric isoforms. For the later type of genes, we calculated the fraction of the total gene expression that comes from the chimeric transcript.
Coding capacity assessment
We assessed whether protein-coding chimeric gene-TE transcripts can produce a protein by using the Coding Potential Assessment Tool (CPAT) software51 with default parameters. CPAT has been optimized for the prediction of coding and non-coding isoforms in Drosophila. Thus, we used the coding probability cutoff at 0.3951.
PFAM scan of domain analysis and enrichment
To scan for PFAM domains52 in the TEs detected in an internal exon, we extracted the TE sequence from the chimeric transcripts using bedtools getfasta72 (v2.29.272, translated it to the longest ORF using getorf75 (EMBOSS:6.6.0.075 and scan it using the script pfam_scan.pl52,76 (v1.6) to identify any of the known protein family domains of the Pfam database (version 34). We used dcGO enrichment online tool53 to perform an enrichment of the PFAM domains detected.
We scanned the consensus TE sequences for the domains present in TE fragments detected in the chimerics transcripts using pfam_scan.pl52,76 (v1.6). If the domain was not detected using pfam default parameters, we lowered the hmmscan e-value sequence and domain cutoffs to 0.05.
Chip-seq peak calling
ChIP-seq reads were processed using fastp77 (v0.20.1) to remove adaptors and low-quality sequences. Processed reads were mapped to the corresponding reference genome using the readAllocate function (parameter: chipThres = 500) of the Perm-seq R package78 (v0.3.0), with bowtie79 (v1.2.2) as the aligner and the CSEM program80 (v2.3) in order to try to define a single location for multi-mapping reads. In all cases bowtie was performed with default parameters selected by Perm-seq.
Then, we used the ENCODE ChIP-Seq caper pipeline (v2, available at: https://github.com/ENCODE-DCC/chip-seq-pipeline2) in histone mode, using bowtie2 as the aligner, disabling pseudo replicate generation and all related analyses (argument chip.true_rep_only = TRUE) and pooling controls (argument chip.always_use_pooled_ctl = TRUE). MACS2 peak caller was used with default settings. We used the output narrowPeak files obtained for each replicate of each sample to call the histone peaks. To process the peak data and keep a reliable set of peaks for each sample, we first obtained the summit of every peak and extended it ±100bp. Next, we kept those peaks that overlapped in at least 2 out of 3 replicates (following 81) allowing a maximum gap of 100bp, and merged them in a single file using bedtools merge72 (v2.30.0). Thus, we obtained for every histone mark of each sample a peak file. We considered that a chimeric gene-TE transcript had a consistent epigenetic status when the same epigenetic status was detected in at least 80% of the samples in which it was detected.
GO clustering analysis
The Gene Ontology (GO) clustering analysis in the biological process (BP) and molecular process (MP) category was performed using the DAVID bioinformatics online tool50. Names of the annotation clusters were manually processed based on the cluster’s GO terms. Only clusters with a score >1.3 were considered50.
Data availability
RNA-seq and ChIP-seq raw data is available in the NCBI Sequence Read Archive (SRA) database under BioProject PRJNA643665. The set of chimeric transcripts detected are available in GitHub (https://github.com/GonzalezLab/chimerics-transcripts-dmelanogaster). DrosOmics genome browser36 (http://gonzalezlab.eu/drosomics) compiles all data generated in this work.
Code availability
Scripts to perform analyses are available at GitHub (https://github.com/GonzalezLab/chimerics-transcripts-dmelanogaster).
FUNDING
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (H2020-ERC-2014-CoG-647900), and from grant PID2020-115874GB-I00 funded by MCIN/AEI/10.13039/501100011033.
ACKNOWLEDGMENTS
We thank Carlos Vargas-Chavez and Simón Orozco for providing the roo consensus sequence of D. simulans. We thank Simón Orozco and Ewan Harney for comments on the manuscript.
Footnotes
Marta Coronado-Zamora, marta.coronado{at}ibe.upf-csic.es
Correction to author's names to match those on the pdf file.
REFERENCES
- 1.↵
- 2.
- 3.↵
- 4.↵
- 5.↵
- 6.
- 7.
- 8.↵
- 9.↵
- 10.
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.
- 16.
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵