ABSTRACT
Variation in the presence or absence of transposable element (TE) insertions is a significant source of inter-individual genetic variation. Here, we identified 23,095 TE presence/absence variants between 216 wild Arabidopsis accessions. Most variants are rare, however over two thirds of the common alleles identified were not in linkage disequilibrium with nearby SNPs, implicating TE variants as a source of missing heritability. Some TE variants were associated with altered expression of nearby genes, with potential functional consequences including decreased pathogen resistance. The majority of inter-accession DNA methylation differences were associated with nearby TE variants, indicating an important role in facilitating epigenomic variation. Examination of TE regulation following crosses between parents with differential TE content revealed that absence of individual TEs from the paternal genome is associated with DNA demethylation within maternal copies of these TEs in the embryo, suggesting parental TE variation may enable activation of previously silent TEs.
INTRODUCTION
Transposable elements (TEs) are mobile genetic elements present in nearly all studied organisms, and make up a large fraction of most eukaryotic genomes. The two classes of TEs are retrotransposons (class I elements), which transpose via an RNA intermediate requiring a reverse transcription reaction, and DNA transposons (class II elements), which transpose via either a cut-paste or, in the case of Helitrons, a rolling circle mechanism with no RNA intermediate (Wicker et al. 2007). TE activity poses significant mutagenic potential because TE insertion may disrupt essential regions of the genome, and so safeguards have evolved in order to suppress this activity. These safeguards include epigenetic transcriptional silencing mechanisms, chiefly involving the methylation of cytosine nucleotides (DNA methylation) to produce 5-methylcytosine (mC), a mark that can signal transcriptional silencing of the methylated locus. In Arabidopsis thaliana (Arabidopsis), DNA methylation occurs in three DNA sequence contexts: mCG, mCHG, and mCHH, where H is any base but G. Establishment of DNA methylation marks can be carried out by two distinct pathways – the RNA-directed DNA methylation pathway guided by 24 nucleotide (nt) small RNAs (smRNAs), and the DDM1/CMT2 pathway (Zemach et al. 2013; Matzke and Mosher 2014). A major function of DNA methylation in Arabidopsis is in the transcriptional silencing of TEs. Loss of DNA methylation due to mutations in genes essential for DNA methylation establishment and maintenance leads to expression of previously silent TEs, and sometimes transposition (Mirouze et al. 2009; Miura et al. 2001; Saze et al. 2003; Lippman et al. 2004; Jeddeloh et al. 1999; Zemach et al. 2013).
TEs are thought to play an important role in evolution, not only because of the disruptive potential of their transposition. The release of transcriptional and post-transcriptional silencing of TEs can lead to bursts of TE activity, quickly generating new genetic diversity on which selection may act (Vitte et al. 2014). TEs may carry regulatory information such as promoters and transcription factor binding sites, and their mobilization may lead to the creation or expansion of gene regulatory networks (Hénaff et al. 2014; Bolger et al. 2014; Ito et al. 2011; Makarevitch et al. 2015). Furthermore, the transposase enzymes required and encoded by TEs have frequently been domesticated and repurposed as endogenous proteins, such as the DAYSLEEPER gene in Arabidopsis, derived from a hAT transposase enzyme (Bundock and Hooykaas 2005). Clearly, the activity of TEs can have widespread and unpredictable effects on the host genome. However, the identification of TE presence/absence variants in genomes has remained difficult. It is challenging to identify the structural variants caused by TE mobilization using current short-read sequencing technologies as these reads are typically mapped to a reference genome, which has the effect of masking structural changes that may be present. However, in terms of the number of base pairs affected, a large fraction of genetic differences between Arabidopsis accessions appear to be due to variation in TE content (Cao et al. 2011). Therefore identification of TE variants is essential in order to develop a more comprehensive understanding of the genetic variation that exists between genomes, and of the consequences of TE movement upon genome and cellular function.
The tools developed previously for identification of novel TE insertion sites have several limitations. They either require a library of active TE sequences, cannot identify TE absence variants, or are not designed with population studies in mind (Thung et al. 2014; Robb et al. 2013; Hénaff et al. 2015). In order to accurately map the locations of TE presence/absence variants with respect to a reference genome, we have developed a novel algorithm, TEPID (Transposable Element Polymorphism IDentification), which is designed for population studies. We tested our algorithm using both simulated and real Arabidopsis sequencing data, finding that TEPID is able to accurately identify TE presence/absence variants with respect to the Col-0 reference genome. We applied our TE analysis method to existing genome resequencing data for 216 different wild Arabidopsis accessions, and identified widespread TE variation amongst these accessions (Schmitz et al. 2013). The majority of these TE variants arose recently in evolutionary times, represent novel genetic variants, and are associated with a variety of epigenomic and transcriptional variation.
RESULTS
Computational identification of TE presence/absence variation
We developed TEPID, an analysis pipeline capable of detecting TE presence/absence variants from paired end DNA sequencing data. TEPID integrates split and discordant read mapping information, read mapping quality, sequencing breakpoints, as well as local variations in sequencing coverage to identify novel TE presence/absence variants with respect to a reference TE annotation (Figure 1A; see methods). After TE variant discovery has been performed, TEPID then includes a second step designed for population studies. This examines each region of the genome where there was a TE insertion identified in any member of the group, and checks for evidence of this insertion in each member of the population. In this way, TEPID leverages TE variant information for a group of related samples to correct false negative calls within the group. This feature sets TEPID apart from previous similar methods for TE variant discovery using short read data (Hénaff et al. 2015). Testing of TEPID using simulated TE variants in the Arabidopsis genome showed that it was able to reliably detect simulated TE variants (Figure 1B). In order to assess further the accuracy of TE variant discovery using TEPID, we compared our predicted TE variants identified in the Landsberg erecta (Ler) accession with the de novo assembly reference genome created using long sequencing reads. Previously published 100 bp paired-end Ler genome resequencing reads (Schneeberger et al. 2011) were first analyzed using TEPID, enabling identification of 446 TE insertions and 758 TE absence variants with respect to the Col-0 reference TE annotation (Supplementary File 1). Reads providing evidence for these variants were then mapped to the Ler reference genome that was generated by de novo assembly using Pacific Biosciences P5-C3 chemistry with a 20 kb insert library (Chin et al. 2013), using the same alignment parameters as was used to map reads to the Col-0 reference genome. This resulted in 98.7% of reads being aligned concordantly to the Ler reference, whereas 100% aligned discordantly or as split reads to the Col-0 reference genome (Table 1). To find whether reads mapped to homologous regions in both the Col-0 and Ler reference genomes, we conducted a blast search (Camacho et al. 2009) using the DNA sequence between read pair mapping locations in the Ler genome against the Col-0 genome, and found the top blast result for 80% of reads providing evidence for TE insertions, and 89% of reads providing evidence for TE absence variants in Ler, to be located within 200 bp of the TE variant reported by TEPID. We conclude that reads providing evidence for TE variants map discordantly or as split reads when mapped to the Col-0 reference genome, but map concordantly to homologous regions of the Ler de novo assembled reference genome, indicating that structural variation is present at the sites identified by TEPID, and that this is resolved in the de novo assembled genome.
(A) Principle of TE variant discovery using split and discordant read mapping positions. Paired end reads are first mapped to the reference genome using Bowtie2 (Langmead and Salzberg 2012). Soft-clipped or unmapped reads are then extracted from the alignment and re-mapped using Yaha, a split read mapper (Faust and Hall 2012). All read alignments are then used by TEPID to discover TE variants relative to the reference genome, in the ‘tepid-discover’ step. When analyzing groups of related samples, these variants can be further refined using the ‘tepid-refine’ step, which examines in more detail the genomic regions where there was a TE variant identified in another sample, and calls the same variant for the sample in question using lower read count thresholds as compared to the ‘tepid-discover’ step, in order to reduce false negative variant calls within a group of related samples. (B) Testing of the TEPID pipeline using simulated TE variants in the Arabidopsis Col-0 genome (TAIR10), for a range of sequencing coverage levels. (C) PCR validation examples for TE presence/absence variant predictions. Accessions that were predicted to contain a TE insertion or TE absence are marked in bold. For the example TE absence variant, two primer sets were used; forward (F) and reverse (R) or internal (I). Accessions with a TE absence will not produce the FI band and produce a shorter FR product, with the change in size matching the size of the deleted TE. For the example TE insertion, one primer set was used, spanning the TE insertion site. A band shift of approximately 200 bp can be seen, corresponding to the size of the inserted TE. (D) Summary table of PCR validation results for TE variants.
Mapping of paired-end reads providing evidence for TE presence/absence variants in Ler to Col-0 and Ler reference genomes.
Abundant TE positional variation among natural Arabidopsis populations
We used TEPID to analyze previously published 100 bp paired-end genome resequencing data for 216 different wild Arabidopsis accessions (Schmitz et al. 2013), and identified 15,007 TE insertions and 8,088 TE absence variants, totalling 23,095 unique TE variants. In most wild accessions we identified 300-500 TE insertions (mean = 378; Figure 2 - figure supplement 1A) and 1,000-1,500 TE absence variants (mean = 1,279; Figure 2 - figure supplement 1B), the majority of which were shared by two or more accessions (Figure 2 - figure supplement 2). PCR validations were performed for a random subset of 10 insertions and 10 absence variants in 14 accessions, and confirmed the high accuracy of TE variant discovery using the TEPID package, with results similar to that observed using simulated data (Figure 1C, D). TE variants were distributed throughout chromosome 1 in a pattern that is similar to the distribution of all Col-0 TEs, and were enriched in the pericentromeric regions (Figure 2A, figure supplement 3). This distribution was also similar to that observed for regions of the genome previously identified as being differentially methylated in all DNA methylation contexts (mCG, mCHG, mCHH) between the wild accessions (population DMRs) (Schmitz et al. 2013). Furthermore, TE variants were depleted within genes and DNase I hypersensitivity sites (Sullivan et al. 2014), while they were enriched in gene flanking regions and within other annotated TEs or pseudogenes (Figure 2B). We did not observe any enrichment of TE insertions within the KNOT ENGAGED ELEMENT (KEE) regions (Figure 2 - figure supplement 4), indicating that these regions may not act as a “TE sink” as has been previously proposed (Grob et al. 2014).
(A) Distribution of identified TE variants on chromosome 1, with distributions of all Col-0 genes, Col-0 TEs, and population DMRs. (B) Frequency of TE variants at different genomic features. (C) Enrichment and depletion of TE variants categorized by TE superfamily compared to the expected frequency due to genomic occurrence. (D) r2 correlation matrices for individual representative young, mid, and old TE variants. (E) Rank order plots for individual representative young, mid, and old TE variants (matching those shown in D). Red line indicates the median r2 value for each rank across SNP-based values. Blue line indicates r2 values for TE-SNP comparisons. (F) Histogram of the number of TE r2 ranks (0-600) that are above the SNP-based median r2 value for testable TE variants. TE variants with <200 ranks over the SNP median were classified as “young” elements, as they are not yet linked to the surrounding SNPs. TE variants with 200-400 ranks over the SNP median were classified as “mid” aged. TE variants with >400 ranks over the SNP median were classified as “old” elements, as they were linked to the surrounding SNPs. (G) Chromosomal distribution of TE variants by age.
Among the identified TE variants, several TE superfamilies were over‐ or under-represented compared to the number expected by chance given the overall genomic frequency of different TE types (Figure 2C; Supplementary Table 1). In particular, TE variants in the RC/Helitron superfamily were less numerous than expected, with an 11.5% depletion of RC/Helitron elements in the set of TE variants. TEs belonging to the LTR/Gypsy superfamily were more variable than expected, with a 7.5% enrichment in the set of TE variants. This was unlikely to be due to a differing ability of our detection methods to identify TE variants of different lengths, as the TE variants identified had a similar distribution of lengths as all Arabidopsis TEs annotated in the Col-0 reference genome (Figure 2 - figure supplement 5). These enrichments suggest that the RC/Helitron TEs have been relatively dormant in recent evolutionary history, while the LTR/Gypsy TEs have been more active. At the family level, we observed similar patterns of TE variant enrichment or depletion (Figure 2 - figure supplement 6; Supplementary Table 2). As expected, this TE diversity tended to reflect the known genetic relationships between accessions. We further examined Arabidopsis (Col-0) DNA sequencing data from a transgenerational stress experiment to investigate the possible minimum number of generations required for TE variants to arise (Jiang et al. 2014). We identified a single potential TE insertion in a sample following 10 generations of single-seed descent under high salinity stress conditions, and no TE variants in the control single-seed descent set. However, without experimental validation it remains unclear if this represents a true variant. Therefore, we conclude that TE variants likely arise at a rate less than 1 insertion in 30 generations.
Although thousands of TE variants were identified, it is important to classify their relationship to other commonly identified sources of genetic variation such as single nucleotide polymorphisms (SNPs). This comparison can determine if TE variants are often in linkage disequilibrium with nearby SNPs, or if they are a previously unassessed source of genetic variation between accessions unlinked to the underlying SNP-based haplotypes. To investigate this relationship, SNPs previously identified between the wild accessions (Schmitz et al. 2013) were compared to the presence/absence of individual TE variants. For the 7,300 testable TE variants in the sample set with a minor allele frequency above 3%, the nearest 300 flanking SNPsupstreamanddownstreamoftheTEinsertionsitewereanalyzedforlinkage disequilibrium (LD, r2; Figure 2D-F; see methods). TE variants were classified as being either ‘young’, ‘mid’, or ‘old’ variants by comparing ranked rvalues to flanking SNPs against the median ranked r2 value for all SNP-SNP comparisons to account for regional variation in LD (Figure 2D, E). This analysis identified TE variants that were unlinked (young) or linked (old) to SNPs forming local haplotypes, or were in an intermediate linkage state (mid) with their surrounding haplotype. The majority (61%) of testable TE variants were largely unlinked to nearby SNPs, and are therefore predicted to be young variants present in a few divergent accessions (Figure 2F). In contrast, 29% of TE variants displayed high levels of linkage with nearby SNPs, and are likely old insertions. TE variants displayed a similar chromosomal distribution regardless of age classification (Figure 2G). Overall, this analysis revealed an abundance of previously unexplored genetic variation that exists amongst Arabidopsis accessions caused by the presence or absence of TEs, and illustrates the importance of identifying TE variants to capture missing heritability alongside other genetic diversity such as SNPs.
TE variants affect gene expression
To determine whether the TE variants identified affected nearby gene expression, we compared the steady state transcript abundance within mature leaf tissue, between accessions with and without TE insertions, for genes with TE variants located in the 2 kb gene upstream region, 5’ UTR, exon, intron, 3’ UTR or 2 kb downstream region (Figure 3). While the steady state transcript abundance of most genes appeared to be unaffected by the presence of a TE, 45 genes displayed significant differences in transcript abundance linked with the presence of a TE variant, indicating a functional role for these variants in the local regulation of gene expression (q-value < 0.001 with greater than 2-fold change in transcript abundance; Figure 3, Table 2). We did not find any functional category enrichments in this set of differentially expressed genes. It should be noted that rare TE variants, with a minor allele frequency less than 3%, may also be associated with a difference in transcript abundance, but were unable to be statistically tested due to their rarity. Future studies using larger sample sizes may be able to further examine the frequency at which TE variants impact gene expression.
Volcano plots showing transcript abundance differences for genes associated with TE insertion variants at different positions, indicated in the plot titles. Genes with significantly different transcript abundance in accessions with a TE insertion compared to accessions without a TE insertion are colored blue (lower transcript abundance in accessions containing TE insertion) or red (higher transcript abundance in accessions containing TE insertion). Vertical lines indicate ±2 fold change in FPKM.
Significantly differentially expressed genes dependent on TE presence/absence..
As both increases and decreases in transcript abundance of nearby genes were observed for TE variants within each gene feature, it appears to be difficult to predict the impact a TE variant may have on nearby gene expression. However, gene-level transcript abundance measurements may fail to identify the potential positional effect of TE variants upon transcription. To more closely examine changes in transcript abundance associated with TE variants among the accessions, we inspected a subset of TE variant sites and identified TE variants that appear to have an impact on transcriptional patterns beyond changes in total transcript abundance. For example, the presence of a TE within an exon of AtRLP18 (AT2G15040) was associated with truncation of the transcripts at the TE insertion site in accessions possessing the TE variant, as well as silencing of a downstream gene encoding a leucine-rich repeat protein (AT2G15042) (Figure 4A-D). Both genes have significantly lower transcript abundance in accessions containing the TE insertion (p < 5.8 x 10-10 , Mann-Whitney U test; Figure 4C). AtRLP18 is reported to be involved in bacterial resistance, with the disruption of this gene by T-DNA insertion mediated mutagenesis resulting in increased susceptibility to the bacterial plant pathogen Pseudomonas syringae (Wang et al. 2008). We examined pathogen resistance phenotype data (Aranzana et al. 2005) for accessions with and without a TE insertion in the AtRLP18 exon, and found that accessions containing the TE insertion were sensitive to infection by Pseudomonas syringae transformed with avrPpH3 genes at a much higher frequency (Figure 4E). This suggests that the wild accessions identified here to contain a TE insertion within AtRLP18 may have an increased susceptibility to certain bacterial pathogens.
(A) Genome browser representation of RNA-seq data for genes AtRLP18 (AT2G15040) and a leucine-rich repeat family protein (AT2G15042) for all accessions containing a TE insertion within the exon of the gene AtRLP18, and for a random subset of accessions not containing the TE insertion within the exon of AtRLP18. (B) Magnified view of the TE insertion variant within the AtRLP18 exon. (C) Boxplots comparing transcript abundance of AtRLP18 and AT2G15042 in accessions with and without the TE insertion in the AtRLP18 exon. Asterisk indicates statistical significance (p ≤ 5.8 × 10−10; Mann-Whitney U test). Boxes represent the interquartile range (IQR) from quartile 1 to quartile 3. Boxplot upper whiskers represent the maximum value, or the upper value of the quartile 3 plus 1.5 times the IQR (whichever is smaller). Boxplot lower whisker represents the minimum value, or the lower value of the quartile 1 minus 1.5 times the IQR (whichever is larger). (D) Heatmap showing AtRLP18 and AT2G15042 RNA-seq FPKM values for all accessions. (E) Percentage of accessions with resistance to Pseudomonas syringae transformed with different avr genes, for accessions containing or not containing a TE insertion in AtRLP18. (F) Genome browser representation of RNA-seq data for a PPR protein-encoding gene (AT2G01360) and QPT (AT2G01350), showing transcript abundance for these genes in accessions containing a TE insertion variant in the upstream region of these genes. (G) Heatmap representation of RNA-seq FPKM values for QPT and a gene encoding a PPR protein (AT2G01360), for all accessions. Note that scales are different for the two heatmaps, due to the higher transcript abundance of QPT compared to AT2G01360. Scale maximum for AT2G01350 is 3.1 × 105, and for AT2G01360 is 5.9 × 104. (H) Boxplots showing RNA-seq FPKM differences for QPT and AT2G01360 associated with presence/absence of a TE variant in the gene upstream region. Asterisk indicates statistical significance (p ≤ 1.8 × 10−7, Mann-Whitney U test). Boxplots were constructed as for C.
We also observed some TE variants associated with increased expression of nearby genes. For example, presence of a TE within the upstream region of a gene encoding a pentatricopeptide repeat (PPR) protein (AT2G01360) was associated with higher steady state transcript abundance of this gene (Figure 4F-H). Interestingly, transcription appeared to begin at the TE insertion point, rather than the transcriptional start site of the gene (Figure 4F). AccessionscontainingtheTEinsertionhadsignificantlyhigherAT2G01360transcript abundance than the accessions without the TE insertion (p < 1.8 x 10-7 , Mann-Whitney U test; Figure 4H). The apparent transcriptional activation, linked with presence of a TE belonging to the HELITRON1 family, indicates that this element may carry sequences or other regulatory information that has altered the expression of genes downstream of the TE insertion site. Importantly, this variant was classified as a young TE insertion, as it is not in linkage disequilibrium with surrounding SNPs, and therefore the associated changes in gene transcript abundance would not be identified using only SNP data. This TE variant was also upstream of QPT (AT2G01350), involved in NAD biosynthesis (Katoh et al. 2006), which did not show alterations in steady state transcript abundance associated with the presence of the TE variant, indicating a potential directionality of regulatory elements carried by the TE (Figure 4G, H). Overall, these examples demonstrate that TE variants can have unpredictable, yet important, effects of the expression of nearby genes, and these effects may be missed by studies focused on genetic variation at the level of SNPs.
TE variants drive DNA methylation differences between accessions
As TEs are frequently highly methylated in Arabidopsis (Lister et al. 2008; Cokus et al. 2008; Zhang et al. 2006; Zilberman et al. 2007), we next assessed the DNA methylation state surrounding TE variant sites to determine whether TE variants might be responsible for some of the differences in DNA methylation patterns previously observed between the wild accessions (Schmitz et al. 2013). We found that 61% of the 13,485 previously reported population DMRs were located within 1 kb of a TE variant, significantly more than expected by chance (p < 1 x 10-4, determined by resampling 10,000 times; Figure 5A). Old TE variants were more often located close to population DMRs, with 41.5% of TE variants within 1 kb of a population DMR classified as old, 11.5% higher than the total genomic frequency of old elements (Figure 5B). There was a corresponding depletion of young TE variants in this set, with 48.5% being classified as young, 11.5% lower than the total genomic frequency of young TE variants. This indicates that established TE insertions may be more important than recent TE insertions in defining DNA methylation landscapes. Alternatively, unmethylated TEs may be preferentially lost from the population due to selection, resulting in old TE variants being more highly methylated. DNA methylation levels at population DMRs located within 1 kb a TE variant, henceforth termed TE-DMRs, were positively correlated with the presence of the TE variant (Figure 5C), while DNA methylation levels at population DMRs further than 1 kb from a TE variant did not have a significant association with the presence/absence of the nearest TE variant. TE-DMRs were significantly more highly methylated in accessions containing the TE, suggesting that TE variants may facilitate changes in DNA methylation patterns between accessions (Welch’s t-test, p < 2.2 × 10-16; Figure 5D). Overall, this indicates that a large fraction of the population DMRs previously identified between these accessions are associated with the presence of local TE insertions.
(A) Distance from each population DMR to closest TE variant (red), compared with a set of randomly selected regions of the genome, of the same size as all TE variants (blue). (B) Density of distance measurements between population DMRs and the closest TE variant, grouped by age classification of TE variant. (C) Density of Pearson correlation values between DNA methylation levels at population DMRs and the presence/absence of the nearest TE variants. Pearson correlation values for population DMRs within 1 kb of a TE variant (TE-DMRs) are shown in red, while values for population DMRs further than 1 kb from a TE variant (non TE-DMRs) are shown in blue. (D) Boxplot showing DNA methylation levels at TE-DMRs for accessions containing the TE insertion and accessions not containing the TE insertion. Asterisk indicates statistical significance (p ≤ 2.2 × 10−16), Welch’s t-test). Boxplots were constructed as for Figure. 4C. (E) DNA methylation levels in 200 bp windows ± 2 kb from TE insertion sites, for accessions with and without the TE insertion. Heatmap is separated into in regions of the genome less than 3 Mb from a centromere (pericentromeric regions) or greater than 3 Mb from a centromere (chromosome arms), and sorted by total DNA methylation at each locus. (F) Average DNA methylation levels in mCG, mCHG and mCHH contexts in regions surrounding TE insertion variant sites for pericentromeric insertions and insertions in the euchromatic chromosome arms. (G) Density of Pearson correlation values between DNA methylation levels in 200 bp regions flanking TE variant sites and the presence/absence of the TE variant, separated by TE age groups. Only TEs that were able to be assigned an age group are shown (total 7,300; minor allele frequency ≥3%).
We next examined levels of DNA methylation in regions flanking all TE insertions regardless of the presence or absence of a population DMR call, and similarly found that accessions containing a TE insertion had a highly localized enrichment in DNA methylation in all sequence contexts (Figure 5E, F), again indicating that TE variants may play a role in shaping DNA methylation landscapes between Arabidopsis accessions. As the increase in DNA methylation around TE insertion sites appeared to be restricted to regions >200 bp from the insertion site, we correlated DNA methylation levels in 200 bp regions flanking TE variants with the presence/absence of TE variants. DNA methylation levels were positively correlated with the presence of a TE (Figure 5G). Furthermore, DNA methylation level was more strongly correlated with the presence of old TE variants. These results indicate that DNA methylation patterns are influenced by the differential TE content of individual genomes, and that the DNA methylation-dependent silencing of TE variants may lead to formation of DMRs between wild Arabidopsis accessions. The age of TE variants also appears to be related to the DNA methylation state surrounding the TE insertion site, with older variants being more highly methylated, suggesting that many generations may be required before a new TE insertion reaches a highly methylated state.
TE regulation in the germline
To explore when during development germline changes in TE content may occur, we sought to associate TE DNA methylation dynamics during germline development and early embryogenesis with the differential TE content we observe between the different Arabidopsis accessions. Complex DNA methylation changes occur in the Arabidopsis germline, particularly within TE sequences. During male germline development, the haploid microspore undergoes asymmetric division to produce the larger vegetative cell and a smaller cell that divides once more to produce two identical sperm cells (Figure 6A) (Kawashima and Berger 2014). Some TEs have previously been found to become transcriptionally active and to transpose in the vegetative cell nucleus (VN). This TE activation has been linked to an increase in 21 nt smRNAs derived from activated TE sequences, that are thought to then be transported to the sperm cells (Slotkin et al. 2009). In contrast, the sperm and post-meiotic microspore genomes lose DNA methylation in the mCHH context within many TEs due to decreased DRM2 abundance but are able to maintain TE silencing (Slotkin et al. 2009; Calarco et al. 2012). A largely distinct set of TEs are demethylated in the mCG context in the VN compared to the microspore and sperm cells, and are thought to be involved in epigenetic imprinting rather than TE silencing (Calarco et al. 2012). In the female germline, the ovule contains a diploid central cell and haploid egg cell. As in the VN, the non-generative central cell acts as a companion to the egg and expresses DEMETER, which encodes a DNA demethylase, resulting in global DNA demethylation (Ibarra et al. 2012). During fertilization, one sperm fertilizes the egg to form the diploid embryo, while the other sperm fertilizes the diploid central cell to form the triploid endosperm, which supplies nutrients to the developing embryo (Figure 6A) (Kawashima and Berger 2014). The endosperm remains globally demethylated, while DNA methylation is gradually restored in the embryo through RdDM (Jullien et al. 2012; Hsieh et al. 2009).
(A) Diagram of the developing pollen, ovule, and seed. During dual fertilization, one sperm fertilizes the egg forming the embryo, while the other fertilizes the central cell to form the endosperm. (B) Percentage of TEs differentially methylated in germline cell types that were absent in at least one non-Col-0 accession (shaded region) and expected percentage (histogram). 68% of TEs that lose mCHH in the sperm cell or microspore are absent from non-reference accessions, significantly more often than expected (p < 1 × 10−4, determined by resampling 10,000 times). Embryo DMRs are in comparison to aerial tissues, endosperm DMRs are in comparison to the embryo. (C) Frequency of transposition for TEs differentially methylated in the microspore (MS), sperm cell (SC), embryo (EM) and endosperm (EN). Black triangles represent the mean number of unique TE insertion sites caused by elements in each list, asterisk indicates significance (p #x2264; 1 × 10−5, Welch’s t-test). Boxplots were constructed as for Figure. 4C. (D) Correlation between whole seed 21-24 nt smRNA levels (normalized reads per million values in 300 bp windows; RPM) for TEs present in maternal or paternal genomes only, compared with smRNA levels in crosses where the TE is present or absent in both parental genomes. Direction of each cross is represented on axes as female × male. TE variants are always present in the Col-0 genome and absent in the non-Col-0 genome. R values are Pearson’s correlation coefficient. (E) Whole seed 21-24 nt smRNA abundance (RPM, as for D) in TEs absent in Cvi for crosses between Cvi and Col-0, or absent in Ler for crosses between Ler and Col-0. Direction of each cross is represented as female x male on boxplot labels. Replicates are plotted side-by-side, p-values are the result of a Mann-Whitney U test using averaged replicate data. Boxplots were constructed as for Figure. 4C. (F) Embryonic DNA methylation levels 6 days after pollination for Cvi-absent TEs that lose mCHH in sperm, following reciprocal crosses between Col-0 and Cvi. Boxplots were constructed as for Figure. 4C.
In order to determine whether these TEs that undergo DNA methylation changes in the male germline or in the embryo and endosperm are especially active elements, we examined both the number of unique insertions caused by these differentially methylated elements (transposition frequency), as well as the presence/absence variability of these TEs among the wild accessions (Figure 6B, C) (Hsieh et al. 2009; Calarco et al. 2012). We found that 68% of TEs previously found to be demethylated in the mCHH context in Col-0 sperm and microspore were absent from non-Col-0 accessions (Figure 6B), a much larger fraction than expected by chance (p < 1 × 10-4 , determined by resampling 10,000 times) (Calarco et al. 2012). In contrast, TEs differentially methylated in the mCG context in the sperm or microspore, thought to be involved in epigenetic imprinting, or in the mCHH context in the developing embryo (Hsieh et al. 2009), were not absent from non-Col-0 accessions significantly more often than expected (p > 0.09, determined by resampling 10,000 times; Figure 6B). While TEs mCG-hypomethylated in the endosperm were absent from non-Col-0 genomes significantly more often than expected (p < 1 × 10-4, determined by resampling 10,000 times), the scale of this difference was small, with only 30% of the TEs absent in non-Col-0 accessions. mCHH-demethylated TEs in the sperm and microspore also had a significantly higher transposition frequency, suggesting that they have been more active than most other TEs in the Arabidopsis genome (Figure 6C; Welch’s t-test, p < 1 × 10-5). While mCHH-hypermethylated TEs in the embryo were not found to be frequently absent from non-Col-0 accessions, these elements did show a significantly higher transposition frequency, as did elements mCG-hypomethylated in the endosperm, indicating that demethylation of active TEs in the germline companion cells (the VN and endosperm) may be important in driving DNA methylation changes in the generative cells (the sperm and embryo), perhaps acting to suppress transposition of active elements in the germline.
TEs that are demethylated in the mCHH context in the developing sperm may rely on smRNAs in the seed for the restoration of proper DNA methylation of paternal TE sequences in the embryo following fertilization, through RdDM (Jullien et al. 2012). These smRNAs are thought to be maternally-derived, and are likely produced in the endosperm (Mosher et al. 2009; Calarco et al. 2012). We sought to determine whether TEs present in only one parental genome in an intraspecific cross would show altered levels of smRNAs targeted to these TEs, as the absence of a TE in the maternal genome may lead to the loss of a maternally-derived smRNA signal in the seed. Using previously published whole seed 21-24 nt smRNA data for reciprocal crosses between Col-0 and Cvi or Ler (Pignatta et al. 2014; Gehring et al. 2014), we examined seed smRNA levels in TEs present in the Col-0 genome but absent from the Cvi genome for crosses between Col-0 and Cvi, or absent from the Ler genome for crosses between Col-0 and Ler (Figure 6D, direction of each cross is depicted as female × male on axis labels). Consistent with previous reports, we found that smRNA levels in TEs were dependent on maternal genotype but independent of paternal genotype, indicating that they are maternally-derived. Furthermore, we confirmed that TEs present in only the paternal genome in a cross had significantly lower 21-24 nt smRNA levels compared with crosses where the TE was present in the maternal genome (p < 1.4 x 10-10, Mann-Whitney U test; Figure 6E). As these smRNAs are thought to be required for TE silencing in the embryo (Jullien et al. 2012; Matzke and Mosher 2014), this may then lead to TE activation and may explain the higher transposition frequencies we observe for TEs demethylated in the mCHH context in the sperm genome.
To test whether the absence of a TE in the maternal genome was sufficient to prevent the re-methylation of the paternal copy of that TE following fertilization, we examined previously published embryonic DNA methylation data within Cvi-absent TEs following reciprocal crosses between Col-0 and Cvi (Pignatta et al. 2014) (Figure 6F). For all crosses between Col-0 and Cvi, Cvi-absent TEs showed significantly lower mCG levels, and this is likely due to the global reduction in mCG levels that exists in the Cvi genome rather than the absence of the TE in one parent (Pignatta et al. 2014). Surprisingly, we found for TEs that lose mCHH in the sperm (Calarco et al. 2012), when the TE was absent from the maternal genome (Cvi x Col-0), embryonic mCHH levels within these TEs were not significantly different from DNA methylation levels in the Col-0 x Col-0 cross, where the TE was present in both parents (p > 0.04, Mann-Whitney U test; Figure 6F). This was contrary to the anticipated pattern, where the loss of maternal smRNAs observed for this cross (Figure 6E) would prevent these sperm-demethylated TEs from being re-methylated in the embryo, and may indicate that maternally-derived smRNAs are not essential for establishing DNA methylation patterns in early embryonic development. However, when the parents were reversed and the TE variants were present in the maternal genome but absent from the paternal genome (Col-0 x Cvi), we found a significant reduction in mCHH methylation for these TEs, indicating that paternal signals may be required for DNA methylation of maternal TE sequences (p < 5 x 10-7, Mann-Whitney U test). To further investigate this result, we analyzed an independent dataset containing embryonic DNA methylation data for crosses between Col-0 and Ler, generated at a slightly later developmental stage (7-8 days after pollination) than the data for crosses between Col-0 and Cvi (6 days after pollination). Within TEs absent from Ler and demethylated in the Col-0 sperm (Ibarra et al. 2012) we found no significant difference in DNA methylation levels in any context dependent on the parental genotypes (Figure 6 - figure supplement 1). This could indicate that embryonic DNA methylation changes within maternal TEs, dependent upon paternal genotype, may be restricted to early developmental time points, less than 6 days after pollination, or may be dependent on the intraspecific cross performed.
Considering the high proportion of sperm mCHH-demethylated TEs that were found to be absent in non-Col-0 accessions (Figure 6B), crosses between plants with differential presence/absence of these TEs may be possible in the wild, and such crosses may lead to the demethylation of the maternal copies of these TEs in the embryo when a paternal copy is absent. These findings support a model for TE silencing escape facilitated by genetic differences between parents, as the loss of DNA methylation within TEs has been shown previously to be sufficient for transcriptional activation and transposition of demethylated TEs (Gendrel et al. 2002; Hirochika et al. 2000; Stroud et al. 2013). Such mating situations may also lead to the formation of stable epialleles depending of the length of time the TE sequences are able to escape remethylation, and this may play a role in formation of the patterns of differential DNA methylation previously observed between wild accessions (Schmitz et al. 2013).
DISCUSSION
Here, we discovered widespread differential TE content between wild Arabidopsis accessions. A subset (32%) of TE variants with a minor allele frequency above 3% were able to be tested for linkage with nearby SNPs. The majority of these TE variants were unlinked to surrounding SNPs, indicating that they represent genetic variants currently overlooked in genomic studies. We found a marked depletion of TE variants within gene bodies and DNase I hypersensitivity sites, indicating that the more deleterious TE insertions have likely been removed from this population through selection. Importantly, we were able to identify examples where TE variants appear to have an effect upon gene expression, both in the disruption of transcription and in the spreading or disruption of regulatory information leading to the transcriptional activation of genes, indicating that these TE variants can have important consequences upon expression of protein coding genes. Furthermore, we provide evidence that differential TE content between genomes of wild Arabidopsis accessions underlies a large fraction of the previously reported DNA methylation differences between accessions. Thus, the frequency of pure epialleles, independent of underlying genetic variation, may be even more rare than previously anticipated (Richards 2006). The level of DNA methylation changes associated with TE variants is related to TE age, with old variants being more strongly correlated with increased DNA methylation levels. This suggests that the methylation of new TE insertions is a gradual process that occurs incrementally over many generations, or that unmethylated TEs are preferentially lost from the population over time.
Identification of TE variants between Arabidopsis accessions has also enabled a closer examination of the changes in TE smRNA and DNA methylation levels following fertilization of intraspecific hybrids. smRNA levels in the seed appear to be dependent on maternal genome content, as the presence of a TE in only the paternal genome of a cross is associated with decreased levels of corresponding 21-24 nt smRNAs derived from those TEs in the seed. This loss of smRNAs, dependent on maternal genotype, is strikingly similar to findings from studies performed in Drosophila melanogaster over 3 decades ago, where maternal absence of paternal P elements in a cross was found to lead to activation and frequent transposition of these P elements, due to absence of maternally-derived smRNA signals needed for TE silencing, while TE activation was not observed when the P elements were present in the maternal genome (Bingham et al. 1982; Blumenstiel and Hartl 2005). This is thought to be the underlying cause of hybrid dysgenesis, a phenotype characterised by sterility and high rates of germline TE activity, and where the transpositions caused by active P elements are often lethal to the hybrid offspring. It has been hypothesized previously that a process similar to Drosophila hybrid dysgenesis may occur in Arabidopsis, as smRNAs in the Arabidopsis seed also appear to be maternally-derived, and there is some prior evidence that this may be true (Martienssen 2010). Interspecific crosses between Arabidopsis thaliana females and Arabidopsis arenosa males was observed to lead to the expression of previously silent paternal ATHILA TEs, thought to be due to the presence of ATHILA elements in the A. arenosa genome that are absent from the A. thaliana genome (Josefsson et al. 2006). However, the reciprocal cross is impossible to generate due to pollination failure, constituting a fundamental limitation of this system for studying hybrid dysgenesis. The use of wild A. thaliana accessions may prove more fruitful in future experiments aiming to elucidate the processes of germline TE regulation in plants.
Surprisingly, we found that this decrease in smRNA levels targeting paternal TE sequences was not linked to DNA demethylation of the corresponding paternal TEs in the embryo, but instead observed an inverse relationship, where TEs absent from the paternal genome in a cross were linked with embryonic demethylation of the maternal copy. This indicates that maternal smRNAs are not essential for restoring the paternal patterns of DNA methylation that are erased in the sperm, nor are they sufficient to maintain DNA methylation within maternal TEs in the absence of paternal TE copies. As 21 nt smRNA production is greatly increased within the pollen VN, and these smRNAs may be transported to the sperm cells (Slotkin et al. 2009), it is possible that these paternal smRNAs remain present in the sperm cytoplasm during fertilization. These smRNAs may play an important role in establishing early patterns of DNA methylation in the embryo, perhaps explaining the non-reliance of paternal TE sequences upon maternal smRNAs. This issue is somewhat complicated by the lack of embryo-specific smRNA data, as all existing data have been generated from whole seed. If maternal smRNAs are produced and remain in the endosperm rather than the embryo, this could further explain the apparent reliance upon paternal silencing signals for proper establishment of DNA methylation patterns following fertilization. Our data provides the first evidence that embryonic TE silencing in Arabidopsis may be dependent on paternal, rather than maternal, silencing factors.
DNA methylation changes triggered by genetic differences between parents clearly occur in Arabidopsis, although perhaps via a different mechanism as is responsible for causing hybrid dysgenesis in Drosophila. These DNA methylation changes that occur in the embryo may play an important role in the formation of DNA methylation differences between wild Arabidopsis accessions, depending on the length of time these changes are able to persist in hybrid progeny. If the loss of DNA methylation within maternal TEs that we observe is able to persist in hybrid plants, this may lead to the formation of stable epialleles as the demethylated TE copy is propagated through the population. Further experiments will be required to determine the stability of these embryonic changes in DNA methylation that occur in hybrid plants. Alternatively, if these DNA methylation changes are limited to a small developmental window, perhaps less than 6 days after pollination, there may only be a short period of time where TE silencing is lost and TE activation can occur, leading to new TE insertions in the early embryo. Overall, our results show that TE presence/absence variants between wild Arabidopsis accessions can be linked to many DNA methylation changes previously observed in the population, and can have important consequences upon nearby gene expression. Furthermore, the differential TE content between parents can lead to DNA methylation changes in the early embryo, and could lead to activation of these elements.
METHODS
TEPID development
Mapping
FASTQ files are mapped to the reference genome using the ‘tepid-map’ algorithm (Figure 1A). This first calls bowtie2 (Langmead and Salzberg 2012) with the following options: ‘--local’, ‘--dovetail’, ‘--fr’, ‘-R5’, ‘-N1’. Soft-clipped and unmapped reads are extracted using Samblaster (Faust and Hall 2014), and remapped using the split read mapper Yaha (Faust and Hall 2012), with the following options: ‘-L 11’, ‘-H 2000’, ‘-M 15’, ‘-osh’. Split reads are extracted from the Yaha alignment using Samblaster (Faust and Hall 2014). Alignments are then converted to bam format, sorted, and indexed using samtools (Li et al. 2009).
TE variant discovery
The ‘tepid-discover’ algorithm examines mapped bam files generated by the ‘tepid-map’ step to identify TE presence/absence variants with respect to the reference genome. Firstly, mean sequencing coverage, mean library insert size, and standard deviation of the library insert size is estimated. Discordant read pairs are then extracted, defined as mate pairs that map more than 4 standard deviations from the mean insert size from one another, or on separate chromosomes.
To identify TE insertions with respect to the reference genome, split read alignments are first filtered to remove reads where the distance between split mapping loci is less than 5 kb, to remove split reads due to small indels, or split reads with a mapping quality (MAPQ) less than 5. Split and discordant read mapping coordinates are then intersected using pybedtools (Dale et al. 2011; Quinlan and Hall 2010) with the Col-0 reference TE annotation, requiring 80% overlap between TE and read mapping coordinates. To determine putative TE insertion sites, regions are then identified that contain independent discordant read pairs aligned in an orientation facing one another at the insertion site, with their mate pairs intersecting with the same TE (Figure 1A). The total number of split and discordant reads intersecting the insertion site and the TE is then calculated, and a TE insertion predicted where the combined number of reads is greater than a threshold determined by the average sequencing depth over the whole genome (1/10 coverage if coverage is greater than 10, otherwise a minimum of 2 reads). Alternatively, in the absence of discordant reads mapped in orientations facing one another, the required total number of split and discordant reads at the insertion site linked to the inserted TE is set higher, requiring twice as many reads.
To identify TE absence variants with respect to the reference genome, split and discordant reads separated >20 kb from one another are first removed, as 99.9% of Arabidopsis TEs are shorter than 20 kb, and this removes split reads due to larger structural variants not related to TE diversity (Figure 2 - figure supplement 5). Col-0 reference annotation TEs that are located within the genomic region spanned by the split and discordant reads are then identified. TE absence variants are predicted where at least 80% of the TE sequence is spanned by a split or discordant read, and the sequencing depth within the spanned region is <10% the sequencing depth of the 2 kb flanking sequence, and there are a minimum number of split and discordant reads present, determined by the sequencing depth (1/10 coverage; Figure 1A). A threshold of 80% TE sequence spanned by split or discordant reads is used, as opposed to 100%, to account for misannotation of TE sequence boundaries in the Col-0 reference TE annotation, as well as TE fragments left behind by DNA TEs during cut-paste transposition (TE footprints) that may affect the mapping of reads around annotated TE borders (Plasterk 1991). This was found to improve TE absence detection using simulated data. Furthermore, the coverage within the spanned region may be more than 10% that of the flanking sequence, but in such cases twice as many split and discordant reads are required. If multiple TEs are spanned by the split and discordant reads, and the above requirements are met, multiple TEs in the same region can be identified as absent with respect to the reference genome. Absence variants in non-Col-0 accessions are subsequently recategorized as TE insertions present in the Col-0 genome but absent from a given wild accession.
TE variant refinement
Once TE insertions are identified using the ‘tepid-map’ and ‘tepid-discover’ algorithms, these variants can be refined if multiple related samples are analysed. The ‘tepid-refine’ algorithm is designed to interrogate regions of the genome in which a TE insertion was discovered in other samples but not the sample in question, and check for evidence of that TE insertion in the sample using lower read count thresholds compared to the ‘tepid-discover’ step. In this way, the refine step leverages TE variant information for a group of related samples to reduce false negative calls within the group. This distinguishes TEPID from other similar methods for TE variant discovery utilizing short sequencing reads. A file containing the coordinates of each insertion, and a list of sample names containing the TE insertion must be provided to the ‘tepid-refine’ algorithm, which this can be generated using the ‘http://merge_insertions.py’ script included in the TEPID package. Each sample is examined in regions where there was a TE insertion identified in another sample in the group. If there is a sequencing breakpoint within this region (no continuous read coverage spanning the region), split reads mapped to this region will be extracted from the alignment file and their coordinates intersected with the TE reference annotation. If there are split reads present at the variant site that are linked to the same TE as was identified as an insertion at that location, this TE insertion is recorded in a new file as being present in the sample in question. If there is no sequencing coverage in the queried region for a sample, an “NA” call is made indicating that it is unknown whether the particular sample contains the TE insertion or not.
While the above description relates specifically to use of TEPID for identification of TE variants in Arabidopsis in this study, this method can be also applied to other species, with the only prerequisite being the annotation of TEs in a reference genome and the availability of paired-end DNA sequencing data.
TE variant simulation
To test the sensitivity and specificity of TEPID, 100 TE insertions (50 copy-paste transpositions, 50 cut-paste transpositions) and 100 TE absence variants were simulated in the Arabidopsis genome using the RSVSim R package, version 1.7.2 (Bartenhagen and Dugas 2013), and synthetic reads generated from the modified genome at various levels of sequencing coverage using wgsim (Li et al. 2009) (https://github.com/lh3/wgsim). These reads were then used to calculate the true positive, false positive, and false negative TE variant discovery rates for TEPID at various sequencing depths, by running ‘tepid-map’ and ‘tepid-discover’ using the simulated reads with the default parameters (Figure 1B).
Ler TE analysis
Previously published 100 bp paired end sequencing data for Ler (http://1001genomes.org/data/MPI/MPISchneeberger2011/releases/current/Ler-1/Reads/; Schneeberger et al. 2011) was downloaded and analyzed with the TEPID package to identify TE variants. Reads providing evidence for TE variants were then mapped to the de novo assembled Ler genome (Chin et al. 2013). To determine whether reads mapped to homologous regions of the Ler and Col-0 reference genome, the de novo assembled Ler genome sequence between mate pair mapping locations in Ler were extracted, with repeats masked using RepeatMasker with RepBase-derived libraries and the default parameters (version 4.0.5, http://www.repeatmasker.org). A blastn search was then conducted against the Col-0 genome using the following parameters: ‘-max-target-seqs 1’, ‘-evalue 1e-6’ (Camacho et al. 2009). Coordinates of the top blast hit for each read location were then compared with the TE variant sites identified using those reads.
Arabidopsis TE variant discovery
We ran the TEPID, including the insertion refinement step, on previously published sequencing data for 216 different Arabidopsis populations (NCBI SRA SRA012474; Schmitz et al. 2013), mapping to the TAIR10 reference genome and using the TAIR9 TE annotation. The ‘--mask’ option was also used to mask the mitochondrial and plastid genomes. We also ran TEPID using previously published transgenerational data for salt stress and control conditions (NCBI SRA SRP045804; Jiang et al. 2014), using the ‘--mask’ option to mask mitochondrial and plastid genomes, and the ‘--strict’ option for highly related samples.
TE variant / SNP comparison
SNP information for 216 Arabidopsis accessions was obtained from the 1001 genomes data center (http://1001genomes.org/data/Salk/releases/2013_24_01/; Schmitz et al. 2013). This was formatted into reference (Col-0 state), alternate, or NA calls for each SNP. Accessions with both TE variant information and SNP data were selected for analysis. Hierarchical clustering of accessions by SNPs as well as TE variants were used to identify essentially clonal accessions, as these would skew minor allele frequency calculations. A single representative from each cluster of similar accessions was kept, leading to a total of 187 accessions for comparison. For each TE variant with minor-allele-frequency greater than 3%, the nearest 300 upstream and 300 downstream SNPs with a minor-allele-frequency greater than 3% were selected. Pairwise genotype correlations (r2 values) for all complete cases were obtained for SNP-SNP and SNP-2 TE variant states. r2 values were then ordered by decreasing rank and a median SNP-SNP rank value was calculated. For each of the 600 ranked surrounding positions, the number of times the TE rank was greater than the SNP-SNP median rank was calculated as a relative ‘age’ metric of TE to SNP. TE variants with less than 200 ranks over the SNP-SNP median were classified as ‘young’ insertions. Mid-age TE variants had ranks between 200 and 400, while TE variants with greater than 400 ranks above their respective SNP-SNP median value were classified as ‘old’ variants.
PCR validations
Selection of accessions to be genotyped
To assess the accuracy of TE variant calls in accessions with a range of sequencing depths of coverage, we grouped accessions into quartiles based on sequencing depth of coverage and randomly selected a total of 14 accessions for PCR validations from these quartiles. DNA was extracted for these accessions using Edward’s extraction protocol (Edwards et al. 1991), and purified prior to PCR using AMPure beads.
Selection of TE variants for validation and primer design
Ten TE insertion sites and 10 TE absence sites were randomly selected for validation by PCR amplification. Only insertions and absence variants that were variable in at least two of the fourteen accessions selected to be genotyped were considered. For insertion sites, primers were designed to span the predicted TE insertion site. For TE absence sites, two primer sets were designed; one primer set to span the TE, and another primer set with one primer annealing within the TE sequence predicted to be absent, and the other primer annealing in the flanking sequence (Figure 1C). Primer sequences were designed that did not anneal to regions of the genome containing previously identified SNPs in any of the 216 accessions (Schmitz et al. 2013) or small insertions and deletions, identified using lumpy-sv with the default settings (Layer et al. 2014)(https://github.com/arq5x/lumpy-sv), had an annealing temperature close to 52°C calculated based on nearest neighbor thermodynamics (MeltingTemp submodule in the SeqUtils python module; Cock et al. 2009), a GC content between 40% and 60%, and contained the same base repeated not more than four times in a row. Primers were aligned to the TAIR10 reference genome using bowtie2 (Langmead and Salzberg 2012) with the ‘-a’ flag set to report all alignments, and those with more than 5 mapping locations in the genome were then removed.
PCR
PCR was performed with 10 ng of extracted, purified Arabidopsis DNA using Taq polymerase. PCR products were analysed by agarose gel electrophoresis. Col-0 was used as a positive control, water was added to reactions as a negative control.
mRNA analysis
Processed mRNA data for 144 wild Arabidopsis accessions were downloaded from NCBI GEO GSE43858 (Schmitz et al. 2013). To find differential gene expression dependent on TE presence/absence variation, we first filtered TE variants to include only those where the TE variant was shared by at least 5 accessions with RNA data available, corresponding to a minor allele frequency above 3%. We then grouped accessions based on TE presence/absence variants, and performed a Mann-Whitney U test to determine differences in RNA transcript abundance levels between the groups. We used q-value estimation to correct for multiple testing, using the R qvalue package v2.2.2 with the following parameters: lambda = seq(0, 0.6, 0.05), smooth.df = 4 (Storey and Tibshirani 2003). Genes were defined as differentially expressed where there was a greater than 2-fold difference in expression between the groups, with a q-value less than 0.001. Gene ontology enrichment analysis was performed using DAVID (https://david.ncifcrf.gov/).
DNA methylation data analysis
Processed DNA methylation data for wild Arabidopsis accessions were downloaded from NCBI GEO GSE43857 (Schmitz et al. 2013). Weighted embryo DNA methylation data in 300 bp windows for Col-0 crosses with Cvi 6 days after pollination were downloaded from the Dryad Digital Repository; http://dx.doi.org/10.5061/dryad.gv536.2 (Gehring et al. 2014). Processed embryo DNA methylation data in 50 bp windows for crosses between Col-0 and Ler 7-8 days after pollination were downloaded from NCBI GEO GSE38935 (Ibarra et al. 2012).
Small RNA data analysis
Normalized 21-24 nt smRNA read counts (reads per million reads mapped; RPM) in 300 bp windows for whole seed 6 days after pollination, for reciprocal crosses between Col-0, Cvi, and Ler were downloaded from the Dryad Digital Repository; http://dx.doi.org/10.5061/dryad.gv536.2 (Gehring et al. 2014).
DATA ACCESS
TEPID source code can be accessed at https://github.com/ListerLab/TEPID. Ler TE variants are available in Supplementary File 1. TE variants identified among the 216 wild Arabidopsis accessions resequenced by Schmitz et al. (2013) are available in Supplementary File 2.
AUTHOR CONTRIBUTIONS
R.L. and T.S. designed the research project. R.L. and J.B. supervised research. T.S. developed and tested TEPID. J.C. performed PCR validations of TE variants. T.S. and S.R.E. performed bioinformatic analysis. R.L., T.S., J.B. and S.R.E. prepared the manuscript.
COMPETING FINANCIAL INTERESTS
The authors declare no competing financial interests.
(A) Histograms showing number of TE insertion variants identified for each accession, separated by all TE insertions, and the different TE insertion age classes.
(B) TE absence variants as for A.
Histogram showing number of accessions sharing each TE variant identified.
Chromosomal distributions of genes, Col-0 TEs, TE variants and population DMRs.
(A) Number of TE insertion variants within each 300 kb KNOT ENGAGED ELEMENT (KEE, vertical lines) and the number of TE insertion variants found in 10,000 randomly selected 300 kb windows (histogram).
(B)Table showing number of TE insertion variants within each KEE region, and the associated p-value determined by resampling 10,000 times.
(A)Histogram showing lengths of all annotated TEs in the Col-0 reference genome.
(B)Histogram showing lengths of all TE variants.
(C)Density distribution of log10 TE length for all Col-0 TEs (red) and TE variants (blue).
The percentage of TE variants by TE families were compared to the percentage expected due to the genomic frequency of elements of each TE family. Families with more TE presence/absence variants than expected are plotted in red, indicating percentage enrichment of those elements, while those with a lower number of TE variants than expected are plotted in blue, indicating percentage depletion for that TE family.
DNA methylation levels in sperm-demethylated TEs absent from Ler but present in Col-0. Levels are log2 fold change mC/C, for each DNA methylation context, between Ler x Col-0 and Col-0 x Ler DNA methylation levels. Positive values indicate higher methylation in the Ler x Col-0 (female x male) cross, whereas negative values represent higher methylation in the Col-0 x Ler cross. Boxplots were constructed as for Figure 4C.
TE superfamily enrichments for TE variants
TE family enrichments for TE variants.
ACKNOWLEDGMENTS
This work was supported by the Australian Research Council (ARC) Centre of Excellence program in Plant Energy Biology CE140100008 (J.B., R.L.). R.L. was supported by an ARC Future Fellowship (FT120100862) and Sylvia and Charles Viertel Senior Medical Research Fellowship, and work in the laboratory of R.L. was funded by the Australian Research Council. T.S. was supported by the Jean Rogerson Postgraduate Scholarship. S.R.E. was supported by an Australian Research Council Discovery Early Career Research Award (DE150101206). We thank Robert J. Schmitz, Mathew G. Lewsey, Ronan C. O’Malley, and Ian Small for their critical reading of the manuscript. We also thank Kevin Murray for his helpful comments regarding the development of TEPID.