ABSTRACT
Nearly half of the human genome is made up of transposable elements (TEs) and evidence supports a possible role for TEs in gene regulation. Here, we have integrated publicly available genomic, epigenetic and transcriptomic data to investigate this potential function in a genome-wide manner. Results show that although most TE classes are primarily involved in reduced gene expression, Alu elements are associated with up regulated gene expression. This is consistent with our previously published work which showed that intronic Alu elements are capable of generating alternative splice variants in protein-coding genes, and further illustrates how Alu elements can alter protein function or gene expression level. Furthermore, non-coding regions were found to have a great density of TEs within regulatory sequences, most notably in repressors. Our exhaustive analysis of recent datasets has extended and updated our understanding of TEs in terms of their global impact on gene regulation, and indicates a significant association between repetitive elements and gene regulation.
INTRODUCTION
Repetitive elements are similar or identical DNA sequences present in multiple copies throughout the genome. The majority of the repetitive sequences in the human genome are derived from transposable elements (TEs) [1, 2] that can move within the genome, potentially giving rise to mutations or altering genome size and structure. Typical eukaryotic genomes contain millions of copies of transposable elements (TEs) and other repetitive sequences. TEs fall into two major classes: those moving/replicating via a copy and paste mechanism and an RNA intermediate (retrotransposons) and those moving via direct cut and paste of their DNA sequences (DNA transposons). Retrotransposons can be subdivided into two groups: Those with long terminal repeats (LTRs), and those without LTRs (non-LTRs). Human LTR elements are related to endogenous retroviruses (HERVs), which along with similar elements account for nearly 8% of the human genome [3]. Non-LTR retrotransposons include two sub-types: autonomous long interspersed elements (LINEs) and non-autonomous short interspersed elements (SINEs), which are dependent on autonomous elements for their replication; both LINEs and SINEs are widespread in eukaryotic genomes. LINE-1 (long interspersed element 1) and Alu elements are two TEs that belong to non-LTR retrotransposons, which account for approximately one-quarter of the human genome [1].
A number of existing studies have shown that TEs can influence host genes by providing novel promoters, splice sites or post-transcriptional modification to re-wire different developmental regulatory and transcriptional networks [4–6]. TEs tend to regulate gene expression through several mechanisms [6–9]. For example, the expression levels of protein coding genes containing repetitive elements are significantly associated with the number of repetitive elements in those genes in rodents [7]. L1 family repeats show a stronger negative correlation with expression levels than the gene length [10], and the presence of L1 sequences within genes can lower transcriptional activity [11]. Moreover, TEs have been shown to influence gene expression through non-coding RNAs, resulting in the reduction or silencing of gene expression [12]. For example, the expression of long intergenic non-coding RNAs (lincRNAs) was strongly correlated with HERVH transcriptional regulatory signals [13]. Past studies have found that TEs have contributed to nearly half of the active regulatory elements of the human genome [14], by altering gene promoters and creating alternative promoters and enhancers to regulate gene activity [15–17]. According to previous research, 60% of TEs in both human and mouse were located in intronic regions and all TE families in human and mouse can exonize, supporting the view that TEs may create new genes and exons by promoting the formation of novel or alternative transcripts [18, 19]. The association between repetitive elements and RNAs has also been investigated. For example, Alu elements in lncRNAs can lead to STAU1 mediated mRNA decay by duplexing with complementary Alu elements in the 3’UTRs of mRNAs [20], and the insertion of TEs may also drive the evolution of lincRNAs and alter their biological functions [13].
In this paper, TEs in the human genome were analyzed using genome-wide datasets associated with gene regulation. These datasets enabled an assessment of the association of TEs with chromatin states, as marked by histone modification within six human cell lines, lincRNAs, Gene Ontology (GO) enrichment, as well as overall transcriptome profiles. Whilst our analysis is limited to a general comparison of repeat families, as opposed to specific repeat elements, we found clear associations between repeat families and gene regulation, both within regulatory regions and in the generation of splice variants.
RESULTS
The distribution of repetitive elements in the human genome
Initially, we compared the distribution of repetitive elements in the human genome. We found many repetitive elements overlapped with gene models from the human RefSeq gene datasets, and their distributions with respect to components of the gene model are shown in Fig 1. Most repetitive elements were found in non-coding intervals such as 5’UTR introns, CDS introns, 3’UTR introns and intergenic regions. In regards to clade-specific repeats, these were found more often in introns and intergenic regions than in 3’UTR exons or 5’UTR exons (Fig 1).
Gene regions are shown on the x-axis and the y-axis shows the percentages of the genomic regions containing repetitive elements. Human-specific repeats were those annotated with “Homo sapiens” or “primates” as their origin (see Table S6 for the list of human-specific repeat classes). The remaining repetitive regions were categorized as shared repeats.
Role of transposable elements in gene regulation by chromatin states
Based on previous studies, TE-derived sequences can provide transcription factor binding sites, promoters and enhancers, and insulators/silencers [5, 21, 22]. To look for enrichment of TEs within regulatory elements, we looked at the proportions of nucleotides with a TE in each of the six defined regulatory elements [23] as they appear in different components of the gene model. This represents the probability of a given nucleotide within an regulatory element (RE) being from a transposable element (TE), i.e. p(TE|RE) (see Methods for details) across the set of genic regions (Fig 2A). Confidence intervals for the pairwise differences in p(TE|RE) are shown in Fig 2B, and these reveal that for all regulatory elements, TEs are more sparsely distributed across regulatory elements within CDS exons than across regulatory elements in all other genic regions. Likewise, regulatory elements in the 3’UTR were more sparsely populated with TEs in comparison to those in other regions, with the sole exception of Active Promoters in the 5’UTR, Conversely, Intergenic Polycomb Repressed Regions were enriched for TEs in comparison to these elements in other components of gene models. Finally, Intergenic Insulators were also found to be enriched for TEs in comparison to Insulators in all other components of gene models, except for those in 5’UTRs.
A) Probability estimates are those of an individual base within each region being part of a Regulatory Element, i.e. p(RE), or a Transposable Element within a Regulatory Element, i.e. p(TE|RE). Error bars indicate ±1 Std. Error as calculated on the logit-transformed values. B) 1 − α Confidence Intervals for the difference between logit-transformed probabilities p(TE|RE), adjusted for multiple comparisons at the level α = 0.05/6 within each RE (60). Intervals highlighted in red are those do not contain zeros and are indicative of a significant difference between the two values.
Different classes of transposable elements and their associations with chromatin state
In order to systematically characterize the role of different repeat classes within the defined regulatory elements, the distribution of regulatory elements within specific classes of TEs were investigated, using the estimates of p(RE|TE) (See Methods). Out of all six TE classes investigated (Alu, L1, L2, LTR, MIR and DNA), L1 elements were consistently found with the lowest probability of a nucleotide also belonging to a regulatory element (Fig 3A). It was also clear that regulatory elements had the highest probability of being exapted as Weak Enhancer and Polycomb Repressed Regions compared to the other elements. Confidence intervals were used to perform pair-wise comparisons on the probability of containing an RE for each TE type. No difference was found between ancestral and recent L1 elements for any RE (Fig 3B), and L1s were confirmed as containing a significantly lower proportion of their content as an RE in comparison to all other TEs. The notable exceptions to this were Polycomb Repressed Regions, where little difference was found in their rate of occurrence between any TE types, beyond comparative enrichment in MIR elements compared to L1s. MIR elements were more likely to contain a strong enhancer than all other elements, except L2 and DNA elements. DNA elements were also more likely to contain an Insulator than Alu elements, as well as the previously mentioned L1 elements.
A) Probability estimates are for an individual base within each type of element belonging to each of the regulatory elements, i.e. p(RE|TE). Error bars indicate ±1 Std. Error as calculated on the logit-transformed values. B) 1 − α Confidence Intervals for the difference between logit-transformed probabilities p(RE|TE), adjusted for multiple comparisons at the level α = 0.05/6 within each RE (60). Intervals highlighted in red are those do not contain the zero and are indicative of a significant difference between the two values
Are Regulatory Elements containing TEs abundantly present in long intergenic non-coding RNAs?
TEs are a source of endogenous small RNAs in animals and plants, and endogenous small RNAs are considered to be functionally significant in gene regulation [24]. Furthermore, it is well known that many Alu elements have inserted into long non-coding RNAs and mRNAs, which can cause mRNA decay via short imperfect base-pairing [25]. We expanded this to see whether different classes of TEs had any significant associations with non-coding RNA, especially lincRNAs.
Unsurprisingly, we found that TEs consistently made up a lower proportion of nucleotides in CDS-exons across all regulatory elements, when compared to CDS-introns, lincRNA exons and lincRNA introns (Fig 4). An additional enrichment for TEs in Active Promoters within lincRNA introns was also observed in comparison to all other regions investigated in this stage of the analysis. Weak Promoters in both lincRNA introns and CDS introns also showed TE enrichment compared to both types of exonic regions. The observation that >30% of nucleotides from many of the regulatory elements were derived from TEs was also quite striking. In particular, the observation that lincRNA exonic regions contained the highest RE density for Polycomb Repressed Regions (Fig 4), with a nearly a third of these nucleotides being derived from TEs, suggests that the presence of transposable elements in lincRNA exons may be strongly linked to gene regulation.
A) Probability estimates for an individual base within each type of non-coding region being part of a Regulatory Element p(RE) or a Transposable Element within each Regulatory Element p(TE|RE). Error bars indicate ±1 Std. Error as calculated on the logit-transformed values. B) 1 − α Confidence Intervals for the difference between logit-transformed probabilities p(TE|RE), adjusted for multiple comparisons at the level α = 0.05/6 within each RE (60). Intervals highlighted in red show significant pairwise differences (confidence intervals do not cross the 0 difference value).
Associations of TEs with gene model features and gene expression
Next, we summarized the overall distribution of transposable elements within various components of the gene model, by finding genes containing TEs across single or multiple components (Fig S1, Table 1), and genes containing one or more types of TEs (Fig S2, Table 2). We further examined the relationship between gene length and which components of a gene contain a TE (Fig S3), as well as the relationship between gene length and the presence of a specific type of TE (Fig S4), using a Wilcoxon Test (Tables S2 & S3) in both cases. We found that only genes with TEs in the 3’UTR or within multiple genic regions showed a bias towards longer length, whilst for TEs exclusively within the proximal promoter or 5’UTR there was a bias towards shorter genes (Fig S3; Table S2). When assessing the relationship between gene length and the presence of a specific TE class, the length of genes with Alu, L2 or MIR elements alone were very similar to genes with no TE, whilst L1 and LTR elements showed a bias towards shorter genes, and the presence of multiple elements biased towards longer genes (Fig S4; Table S3).
Total counts of elements within each genomic region, along with the number of genes with Transposable Elements in one region only.
Total counts of each TE element, along with how many are found in isolation, i.e. in genes with no other elements.
Effects on the probability of a gene being detected as expressed due to the presence of a TE across the different component of the gene model
As chromatin states are not always indicative of changes in transcriptional activity, we investigated any effects on human gene expression due to the presence of specific TE classes within each of the four regulatory regions, i.e., Proximal Promoter, 5’UTR, CDS and 3’UTR. However, as TEs are far less frequent in CDS regulatory regions with the vast majority co-occurring with other TEs (Figure S1), the subsequent analysis instead focused on the other three regions. Six human tissue transcriptome datasets (adipose, brain, kidney, liver, skeletal muscle and testes tissue) were selected from the Illumina BodyMap2 dataset for this analysis, and global patterns of gene expression were investigated based on the presence or absence of each TE within each of these three genic regions.
The weighted bootstrap method was applied to both the probability of a gene being detected as expressed (Fig 5A), and to the overall expression levels for those genes detected as expressed (Fig 5B). This revealed that Alu elements are commonly associated with a higher probability of expression when located in either the 5’UTR or the 3’UTR across the majority of tissues. In contrast to the presence of an Alu, the presence of L1 elements in the Proximal Promoter showed a negative impact on the probability of a gene being detected as expressed in 3 of the 6 tissues, with the remaining tissues being directionally consistent and quite likely to be Type II errors (Supplementary Fig S6A).
A) Confidence Intervals for the difference in the probability of a gene being detected as expressed due to the presence of each TE in each genic region. B) Confidence Intervals for the difference in mean log2(TPM) counts. For both A) and B), Confidence Intervals were obtained using the weighted bootstrap and are 1 – α/m intervals, where α = 0.05 and m = 90 as the total number of intervals presented. Dots represent the median value from the bootstrap procedure, whilst the vertical line indicates zero. Intervals which do not contain zero are coloured red, and indicate a rejection of the null hypothesis, H0:Δθ = 0, where θ represents the parameter of interest.
Effects on the levels of gene expression due to the presence of a TE in each component of the gene model
Again using the weighted bootstrap approach to minimize any influence of co-occurring elements, the presence of an Alu in the 5’UTR was found to be associated with increased expression levels in five of the six tissues investigated (Fig 5B). Similarly, Alu elements in the Proximal Promoter were associated with increased expression in two of the tissues. Alu elements in 3’UTR were associated with elevated expression levels in the Kidney sample only. The presence of ion elements showed varying degrees of reduced gene expression across the tissues when located in the 3’UTR only. It was also noted that whilst strongly controlling the family-wise Type-I error rate (FWER), the adjusted confidence intervals will result in an increase in the Type-II error rate where true differences are not able to be detected. As such, the point at which the confidence intervals would include zero was found and taken as a proxy for the p-value. Confidence intervals based on these p values to an FDR of 0.05 are shown in Supplementary Figure S6 with the p values given in Supplementary Table S4. It is clear from this additional approach that the role of TEs such as L1 elements in Proximal Promoters and 3’UTRs, LTR elements in 5’UTRs and many of the elements in the 3’UTR may have been considerably understated in this more conservative approach.
Analysis of genes with exapted or exonized TEs
TEs may influence gene expression in different ways, thus we evaluated the possible functional effects of repetitive elements in the human genome, the six primary repeat classes were mapped to the human genome (http://www.repeatmasker.org). Genic regions (annotated using Gene Ontology) that overlapped with TEs were analyzed to assess the association of TEs with different gene functions.
The three fundamental GO categories are: cellular component, molecular function and biological process. Enrichment information for each GO category is listed in Supplementary Table S5. We discovered that for the biological process category (Fig 6, Table S5a), the predominant types of annotation were related to regulatory processes involving metabolic processes. This was consistent with the annotations for the cellular component terms, which were predominantly for intracellular/cytoplasmic structures (Fig S7, Table S5b). The molecular function terms had functions mainly associated with binding (Fig S8, Table S5c). Using this same method, we also found that genes with protein coding exons containing Alus were enriched for the GO term “intracellular non-membrane-bounded organelle”. Interestingly, these exonization/exaptation events were found associated with splice variants when incorporating Alu sequences (Table S6 & Fig S9, S10).
Enrichment of GO terms of genes containing TEs in “Biological Process”. Genes containing different types of repetitive elements in the proximal promoter regions are labeled as “Promoter with Repeats”, and Genes containing repetitive elements in UTR regions are labeled as “5/3UTR with Repeats”. Genes named “Combined Repeats” are the combined data from 3 regions we mentioned above. The darker the color, the greater the GO term enrichment as determined by FDR.
Moreover, according to our analysis of TEs and alternative splicing data, we found that 2.98% of alternatively spliced transcripts contained TEs within protein coding exons (Table 3). Alu and MIR were more likely to be involved in alternative splicing and exonization, which is consistent with previous studies showing that exonization of SINEs occurred in primates [26]. Based on our study, LINEs may also have contributed to these splice variant activities. This shows that exonization of TEs could potentially increase the coding and regulatory versatility of the transcriptome.
The number of repeats in protein coding regions (CDS-exon) with alternative splicing. Repeats were counted only if they overlapped CDS-exon regions by at least 25 bps.
DISCUSSION
In this work, we have primarily analyzed the distribution of various classes of transposable elements, and their association with regulatory elements (active chromatin) and gene expression in the human genome. Based on the analysis of the TE distributions in genic regions and corresponding gene expression patterns, the presence of some TEs was found to be associated with changes in gene expression. Further, gene function as defined by GO term analysis differed depending on the TE insertion site within the gene. Finally, we looked at TEs present in ncRNAs, specifically lincRNAs, and found that repetitive elements were present at higher levels in lincRNAs than coding exons.
Considering the association between the location of TEs in genes, we found that genes had a greater proportion of sequence originating from TEs in the 5’ and 3’UTRs compared to coding exons (Fig 2). This is not surprising considering the potential adverse effect of TE insertion in a protein coding sequence, but it is also relevant with respect to the known regulatory functions within the UTRs [27, 28]. The repeat content for 5’UTR introns was comparable to other types of introns, but this may be significant in the context of transcriptional repression, where genes with shorter 5’ UTR introns are expressed at higher levels [29, 30].
Furthermore, the presence of TEs in genomic regions that can be epigenetically modified to regulate transcription through active chromatin [31], was consistent with our findings that TEs have a potential role as regulators of gene expression. From our results (Fig 2), we found that some functional regions of active chromatin contained higher percentages of TEs, especially Polycomb Repressed Regions and Weak Enhancers. This fits with the existing theory of epigenetic silencing of TEs [31], since TEs in Polycomb Repressed Regions would inevitably be silenced, and was also consistent with the high repeat content of 5’UTRs, which are also known to regulate gene expression [32]. Furthermore, it has been shown that TEs in 3’UTRs are associated with lower transcript abundance [33], and with the clear exception of Alu elements, we have presented further evidence of this. This suggests that exaptation of repetitive elements into regulatory regions is most often associated with repression of gene expression. The general theme of TEs having a role in transcriptional repression was further supported with lincRNAs, which are known to regulate gene expression through epigenetic mechanisms [34] and competition with transcription factors [25]. In our analysis, lincRNA exons were found to be clearly enriched for Polycomb Repressed Regions whilst the abundance of TEs within these regions was relatively consistent with both CDS and lincRNA introns (Fig 4). However, this overall enrichment for repressed regions is consistent with previous research that lincRNAs containing TEs can reduce gene expression in many tissues and cell lines [13]. As different repeat classes were also found to be present at different levels in active chromatin or in specific regulatory regions, such as Polycomb Repressed Regions (Fig 3), this suggests a function with respect to gene expression.
Figures 5B and S6B summarize our findings with regard to gene expression, and indicate that Alu elements in 3’UTR, 5’UTR and proximal promoter regions are commonly associated with increased gene expression. Taken in addition with the increased probability of expression due to the presence of an Alu in the 5’UTR and 3’UTR (Fig 5A), these results support previous reports showing TEs such as Alus can be exapted as transcription factor binding sites [35–37], but are in contrast with reports concerning the direction of expression for human genes. We also found genes containing L1 elements were associated with decreasing gene expression (Fig 5), and that L1 elements were less prevalent in regulatory elements or active chromatin, when compared to other repeat classes (Fig 3). This makes intuitive sense as most L1 elements in the human genome are 5’ truncated and lack promoter content compared to Alu elements [38]. This is also consistent with a previous study showing that highly and broadly expressed housekeeping genes can be distinguished by their TE content, with these genes being enriched for Alus and depleted for L1s [39]. LTRs were found associated with repression of gene expression, which is in contrast to previous work that implicated LTRs as alternative promoters [40]. Anecdotally, it has been shown that an LTR in the first intron of the equine TRPM gene suppresses gene expression by acting as an alternative poly-A site [41], and the insertion of LTRs in introns has been associated with premature termination of transcription [42], supporting the results presented here. L2 and MIR are ancient TE families conserved among mammals, and are regarded as inactive or fossil TE elements [43]. However, these TEs showed a level of association with reduced gene expression when located in 3’UTRs (Fig 5), which is also consistent with a previous finding on their ability to impact the evolution of gene 3’ ends by containing cis-elements for modified polyadenylation [44].
In addition to potentially altering gene expression by insertion into regulatory elements, TEs may also be associated with specific functional characteristics of expressed protein coding genes. When we examined the functional annotation of repeat containing genes, we found that some functions were over-represented (Table S5). Perhaps the most interesting of these associations was that genes with Alu insertions were found to contribute to coding exons through alternative splice variants. One explanation of this observation is that Alu-induced alternative transcripts may result in nonsense mediated decay of alternative transcripts [45]. Two examples of alternatively spliced genes of this type with implications for human disease are DISC1 and NOS3 (Table S6 and Fig S9 & S10). DISC1 alternative transcripts are known to contribute to increased risk of schizophrenia [46, 47] and NOS3 transcript variants are associated with cardiovascular disease phenotypes [48, 49]. Based on previous research, nearly 4% of protein-coding sequences include transposable elements, and one-third of them are Alu insertions [50]. Therefore, Alu exonization in protein-coding genes may play an important role in modifying gene expression.
In conclusion, while there are many publications implicating TEs in the regulation of individual genes, our work clarifies some previous uncertainties and resolves some contradictions, confirming that this role of TEs is significant across the genome. In general, most TEs would appear to be strongly associated with repression of gene expression, either through the 5’UTR or perhaps as components of lincRNA exons. However, the presence of Alus in 3’UTR and proximal promoter regions may act to increase gene expression. These results are consistent with some previous published research [10] and provide a new understanding of how repeats are associated with epigenetic regulation of gene expression. Finally, while exapted TEs may contribute to the generation of transcripts that undergo nonsense mediated decay as part of gene regulation, we speculate that they may also provide an opportunity for alternative splicing and novel exaptation. TEs therefore are important agents of change with respect to the evolution of gene expression networks.
MATERIAL AND METHODS
Theoretical framework and methods
We constructed pipelines to analyze the distribution of repetitive elements in different parts of the human genome. Repetitive elements overlapping with protein coding regions, non-coding regions and regulatory elements were identified. GO term over-representation and expression analyses were carried out for repetitive elements overlapping with protein-coding regions. The pipelines and related materials are described below.
Tools used to develop pipelines for repetitive element analysis
The identification and classification of TEs from the human genome was conducted by developing a pipeline with Perl, R [51], and BEDTools [52]. Perl was used to extract information from different datasets. R was used to build graphs to illustrate the repeat distribution in different genic regions, the identification of repetitive elements with respect to functional elements, GO term over-representation analysis and expression analysis of TEs. BED format file intersection was used to extract the overlapping regions between different datasets, with a lower limit of 25-bps. The UCSC Genome Browser [53, 54] was used to download genome sequence data and genome annotations including RefSeq genes. RSEM [55] was adopted to assemble RNA-Seq reads into transcripts and estimate their abundance (measured as transcripts per million (TPM)). Plots were generated using ggplot2 in R [56].
Datasets
Genomes and annotations
NCBI’s Human genome and its annotation datasets (RefSeq hg19) [57] were downloaded from the UCSC Genome Browser [23, 53]. A total of 37,697 human RefSeq transcripts were merged into 18,777 genes by taking the longest transcript(s) that represented each distinct gene locus. Repetitive elements were downloaded from the RepeatMasker (http://www.repeatmasker.org) track of the UCSC Genome Browser. All repetitive sequence intervals were also de-duplicated to deal with potential overlapping repeat annotations. Overall, there were 5,298,130 human repetitive elements which represented approximately 1.467Gb in the human genome.
Regulatory element datasets from six human cell lines
The regulatory element datasets from six human cell lines were downloaded from the UCSC Genome Browser. Each cell line dataset contained the annotation of six regulatory elements: 1) Active Promoters, 2) Weak Promoters, 3) Strong Enhancers, 4) Weak Enhancers, 5) Insulators and 6) Polycomb Repressed Regions. These regulatory element annotations were derived from different chromatin states that have been marked by histone methylation, acetylation and histone variants H2AZ, PolIII, and CTCF [23].
Gene expression datasets from six human tissues
Human RNA-seq data from the Illumina bodyMap2 transcriptome (Paired End reads only) (http://www.ebi.ac.uk/ena/data/view/ERP000546) dataset was used to measure the association between TEs and the expression levels of genes containing TEs in six tissues.
The distribution of repetitive elements in the human genome
To assess how human TEs were distributed in genes, we compared different genic regions containing TEs. Based on Repbase [58, 59] annotations identified by RepeatMasker (http://www.repeatmasker.org), repeat elements in human were divided into two categories: human-specific repeats, and repeats shared with different species. Human-specific repeats were those annotated with “Homo sapiens” or “primates” as their origin (See Table S6 for the list of human-specific repeat classes), whilst those remaining were categorized as shared repeats. Intergenic regions as well as the exons and introns within 5’UTR, CDS and 3’UTR regions from RefSeq genes [60] were then compared with these different categories of repeats. Next, we generated the summarised distributions of repetitive elements overlapping these regions by calculating the proportions of bases belonging to repetitive elements within each of the combined sets of regions, i.e.:
The code repository for the above can be found at https://github.com/UofABioinformaticsHub/RepeatElements.
The occurrence of transposable elements within regulatory regions
We further explored the association between any TE and the regulatory elements defined above, by calculating the proportion of nucleotides within each of the five sets of genic regions (5’UTR, CDS-exon, CDS-intron, 3’UTR and Intergenic) that were part of a regulatory element for each of the six human cell lines. The proportion of nucleotides that were TEs within each regulatory element were also calculated for each genic region. All proportions were subsequently transformed using the logit function for model fitting across tissues (Table S1) using the model
where yijk is the logit transformed proportions representing p(TE|RE) across each genic region i, each regulatory element j and tissue k, such that μ is the overall mean, αi is the effect due to each genic region, βj is the effect due to each regulatory element with (αβ)ij representing any changes not accounted for in the first two terms. Tests for normality and homoscedasticity were performed using the Shapiro-Wilk test and Levene’s test respectively. Where violations of homoscedasticity were found robust standard errors were obtained using the sandwich estimator [61]. Confidence Intervals for pairwise comparisons were obtained as implemented in the R package multcomp [62] in order to control the Type I error at α=0.05 across the entire set of comparisons.
A specific TE analysis was then performed using six of the major human TE classes based on the Repbase classification system: Alu, L1, L2, MIR, LTR and DNA. L1 elements were further resolved into either ancestral (L1M, L1PB and L1PA subfamilies) or recent/clade-specific (L1HS subfamily), based on their Repbase annotations. The proportions of nucleotides within each TE type that were also regulatory elements were calculated giving tissue-specific estimates of p(RE|TE). Proportions were again transformed using the logit function, and the same analysis as above was performed.
Quality control and preprocessing of the gene expression data in different human tissues
RNA-seq reads of six human tissues were first assessed using FastQC software (www.bioinformatics.babraham.ac.uk), to provide an overview of whether the raw RNA-Seq data contained any problems or biases before further analysis. Reads with poor-quality bases were trimmed (based on the results of FastQC with MINLEN set to 26) for subsequent data analysis. Table 4 showed the numbers of reads in raw RNA-Seq datasets and the statistics after the QC process by using Trimmomatic-0.32 [63]. Then, we built transcript reference sequences using rsem-prepare-reference [64] from Hg19 human RefSeq genes. The references were then input to rsem-calculate-expression [64] using default parameters for all 6 tissues to obtain TPM based expression values.
Description of RNA-Seq datasets and QC results. Reads with poor-quality bases were trimmed using FastQC (with MINLEN set to 26, HEADCROP set to 13 and LEADING set to 15).
Proximal promoter regions were defined as 1,000bp upstream of the gene transcription start sites based on the longest transcripts for each gene. Alu, MIR, L1, L2 and LTR repeat regions were then identified within the proximal promoters, 5’UTR, CDS and 3’UTR regions.
The weighted bootstrap procedure for assessing the effects of a TE in each genic region
Many genes contain multiple transposable elements, with only a minority of genes containing a single TE (Fig S2). In order to assess any effects on transcription due to the presence of a single TE, a weighted bootstrap approach was devised. For a given TE within each genic region within each individual tissue, the frequencies of co-occurring TEs and combinations of TEs were noted. Uniform sampling probabilities were then used for the set of genes containing a specific TE in a specific region, whilst sampling weights were assigned to genes lacking the specific TE based on TE composition, such that the TE content of the sampled set of reference genes matched that of the test set of genes, based on the defined categories. Gene length was divided into 10 bins and these were included as an additional category when defining sampling weights. This ensured that two gene sets were obtained for each bootstrap iteration, which were matched in length and TE composition with the sole difference being the presence of the specific TE within each specific genic region (Figure S5). The mean difference in expression level, as measured by log(TPM), and the difference in the proportions of genes detected as expressed were then used as the variables of interest in the bootstrap procedure. The bootstrap was performed on sets of 1000 genes for 10,000 iterations using the proximal promoter as defined above, along with 5’UTR and 3’UTRs. When comparing expression levels, genes with zero read counts were omitted prior to bootstrapping. In order to compensate for multiple testing considerations, confidence intervals were obtained across the m = 90 tests at the level 1 − α/m, which is equivalent to the Bonferroni correction, giving confidence intervals which controlled the FWER at the level α = 0.05. Approximate two-sided p-values were also calculated by finding the point at which each confidence interval crossed zero, and additional significance was determined by estimating the FDR on these sets of p-values using the Benjamini-Hochberg method.
Long intergenic non-coding RNAs and TEs
Annotations for 8,196 previously described putative human lincRNAs were downloaded [65] and the distribution of TEs within regulatory elements in lincRNA exons and introns was obtained using the same methods as above. The previously described regression models were then used to analyse this dataset.
Association of functional elements with human repetitive elements
To demonstrate the potential functional significance of repetitive elements, the Database for Annotation, Visualization and Integrated Discovery (DAVID) [66] was used to perform the GO classification. We first extracted Gene-IDs from overlapping regions between different gene categories (1000bp proximal promoter, 5’UTR, 3’UTR, and the combination of these 3 regions) and TEs. These gene-lists were then submitted to the DAVID Functional Classification Tool. We chose the third level of GO terms to describe the over-represented functional terms for the three datasets and visualized the functional over-representation of overlapped genes using the R package heatmap.2. The p-value was applied in the GO analysis as the standard index to determine the degree of enrichment. The threshold for over-represented GO terms was set to an FDR (Benjamini-Hochberg method) less than 0.05. Protein-coding genes with Alus were also visualised with the UCSC genome browser (http://genome.ucsc.edu/) to compare their mRNA with various gene datasets and annotations.
Association of alternative splicing and protein coding regions containing TEs
In order to assess the relationship between transposable elements and exonization, an alternative splicing annotation dataset (SIB Alt-splicing) was downloaded from the UCSC Genome Browser (http://genome.ucsc.edu/). These data were generated from RefSeq genes, Genbank RNAs and ESTs that aligned to the human genome. A total of 46,973 alternatively spliced transcripts were intersected with gene models containing transposable elements.
AUTHOR CONTRIBUTION
DLA and CCW conceived, designed and managed the study. LZ collected the datasets, implemented the analysis pipeline, and analyzed the data. SMP analyzed the data. ZPQ, DFC, ZQH prepared datasets. LZ, DLA and CCW wrote and revised the manuscript. All authors reviewed and approved the final manuscript.
CONFLICT OF INTEREST
The author(s) declare that they have no competing interests.
ACKNOWLEDGEMENT
The authors wish to thank Dan Kortschak, Atma Ivancevic, Joy Raison, Reuben Buckley and Sim Lin Lim for valuable discussions and critical reading of drafts.