Abstract
Transposons are mobile elements that are commonly silenced to protect eukaryotic genome integrity. In plants, transposable elements (TEs) can be activated during stress conditions and subsequently insert into gene-rich regions. TE-derived inverted repeats (IRs) are commonly found near plant genes, where they affect host gene expression with potentially positive effects on adaptation. However, the molecular mechanisms by which these IRs control gene expression is unclear in most cases. Here, we identify in the Arabidopsis thaliana genome hundreds of IRs located near genes that are transcribed by RNA Polymerase II, resulting in the production of 24-nt small RNAs that trigger methylation of the IRs. The expression of these IRs is associated with drastic changes in the local 3D chromatin organization, which alter the expression pattern of the hosting genes. Notably, the presence and structure of many IRs differ between A. thaliana accessions. Capture-C sequencing experiments revealed that such variation changes short-range chromatin interactions, which translates into changes in gene expression patterns. CRISPR/Cas9-mediated disruption of two of such IRs leads to a switch in genome topology and gene expression, with phenotypic consequences. Our data demonstrate that the insertion of an IR near a gene provides an anchor point for chromatin interactions that can profoundly impact the activity of neighboring loci. This turns IRs into powerful evolutionary agents that can contribute to rapid adaptation.
Introduction
Transposable elements (TEs) are widely distributed among eukaryotic genomes. In a process known as transposition, TEs move within the genome to different locations, usually copying themself as they ‘jump’ (Dubin et al., 2018). Plant genomes are particularly rich in TEs and repetitive elements, which, for example, account for 85% of the maize genome (Schnable et al., 2009). In plants, TEs are commonly silenced through DNA methylation in a process known as RNA-directed DNA methylation (RdDM), which maintains genome integrity (Matzke et al., 2015). To trigger RdDM, short RNA Polymerase IV (RNAPIV)-dependent TE transcripts are converted into double-stranded RNA (dsRNA) by the RNA-dependent RNA polymerase 2 (RDR2) and then to 24 nt small interfering RNAs (siRNAs) by DICER-Like 3 (DCL3) (Matzke et al., 2015; Zhou and Law, 2015). ARGONAUTE 4 (AGO4)-loaded siRNAs then direct de novo methylation of the TE loci by recognizing nascent RNAPV transcripts there. Such methylation ultimately leads to nucleosome condensation and permanent silencing of the TE. Still, massive bursts of TE amplification have occurred in plant genomes in addition to the rarer, but continued movement of individual elements (Lu et al., 2012; Maumus and Quesneville, 2014; SanMiguel et al., 1998). Stress can trigger the activation of TEs and fuel transposition (Baduel et al., 2021; Tittel-Elmer et al., 2010). TEs are thus significant contributors to genetic variation in plant genomes (Baduel et al., 2021; Collier and Largaespada, 2007; Deragon et al., 2008) and have been postulated as drivers of genome evolution and expansion, as well as developmental plasticity and adaptation (Dubin et al., 2018; Lisch, 2013).
There are two main classes of TEs: DNA transposons and retrotransposons. The most abundant DNA transposons are miniature inverted-repeat TEs (MITEs), while the most abundant retrotransposons are long terminal repeat retrotransposons (LTRs) (Dubin et al., 2018). MITEs exhibit characteristic terminal inverted repeats (TIRs) and small direct repeats (target site duplications, TSDs), but lack transposase sequences, making them non-autonomous elements (Fattash et al., 2013; Yang et al., 2009). Many TE-derived inverted repeated (IR) insertions may not be classified as MITE as they lack the above-mentioned components of MITEs, either due to deletions after insertion, or because they were generated through a different process. MITEs are commonly situated near coding genes: for example, almost 60% of rice genes can be associated with a MITE (Lu et al., 2012), with the MITEs often changing the expression of neighboring genes (Lu et al., 2012; Underwood et al., 2022; Wu et al., 2022; Xu et al., 2020; Zhang et al., 2016). Based on this, MITEs have been proposed to play important roles in genome evolution and gene expression (Lu et al., 2012). One key feature of MITEs is that their transcripts can fold into hairpin-shaped dsRNAs due to the extensive sequence complementarity between IR arms. These dsRNA secondary structures are recognized and processed by DCL3 to produce 24-nt siRNAs that trigger DNA methylation without the need for RNAPIV/RDR2 activity (Ariel and Manavella, 2021; Crescente et al., 2022; Cuerda-Gil and Slotkin, 2016; Gagliardi et al., 2019; Sasaki et al., 2014). Thus, transcripts of these MITEs can be initiated from promoters of adjacent genes, triggering their RNA Polymerase II (RNAPII)-dependent DNA methylation. At the sunflower HaWRKY6 locus, the RNAPII-mediated transcription of a MITE triggers the methylation of its coding region and causes the formation of alternative regulatory short-range chromatin loops in the locus that specifically change its expression (Gagliardi et al., 2019).
Three-dimensional chromatin organization has recently emerged as a critical feature determining genome functionality, fine-tuning gene expression and developmental responses in plants (Domb et al., 2022; Zhang and Wang, 2021). Short-range chromatin loops reflect interaction between relatively close regions of DNA, within a few kb, generally within a single locus or between adjacent loci (Gagliardi and Manavella, 2020). Different from canonical regulation by mC methylation in the linear DNA sequence, commonly associated with gene repression, local three-dimensional chromatin organization can induce a plethora of regulatory mechanisms, including transcriptional activation/repression, transcription directionality, alternative splicing, the usage of cryptic termination sites, impaired or enhanced RNAPII elongation, DNA replication and repair (Gagliardi and Manavella, 2020; Grzechnik et al., 2014; Sotelo-Silveira et al., 2018).
In this study, we show that TE-derived IR elements located near genes in the Arabidopsis thaliana genome can cause rearrangements of the chromatin topology, promoting the formation of short-range chromatin loops. These chromatin interactions, which depend on the production of IR-derived siRNAs and de novo DNA methylation, often translate into changes in gene expression. The presence of an IR and its associated chromatin loop near a gene does not cause a uniform regulatory effect, and can either enhance or repress expression depending on the locus and which regions within the locus are part of the loop. Almost one third of the identified gene-associated IRs are not conserved among a set of 216 A. thaliana natural accessions. Our data show that polymorphisms in IRs near genes can be coupled with a change in the chromatin topology of the region. These accession-related changes in chromatin landscapes controlled by IRs can be linked to alteration in traits commonly associated with adaptation, such as flowering. In proof-of-concept experiments, we used CRISPR/Cas9 genome editing to mimic the situation in natural accessions lacking specific IRs, and demonstrated that the IRs help to shape chromatin topology, which in turn can control molecular and organismal phenotypes. We found that IRs downstream of PHYC and CRY1 cause the formation of repressive chromatin loops associated with well-defined developmental phenotypes of some natural accessions. Overall, our data demonstrate that TE-derived IRs can produce changes in chromatin topology, gene expression, and ultimately phenotypic changes through their capacity to trigger DNA methylation autonomously. Given the propensity of TEs to activate during stress responses and the tendency of IRs to locate near coding genes, our finding provides a scenario for these TEs to drive local adaptation and domestication by supporting rapid and sometimes drastic changes in 3D chromatin organization and gene activity after single-set mutational events.
Results
TE-derived IRs located near genes impact the local chromatin topology in Arabidopsis
A TE-derived IR element located ∼600 bp upstream of the HaWRKY6 locus in sunflower serves as an anchor point for the formation of two short-range chromatin loops, and thereby promotes changes in local chromatin topology (Gagliardi et al., 2019). The ultimate outcome is the methylation of the locus, due to 24-nt siRNAs produced after RNAPII-dependent transcription of these IRs. In leaves, this leads to the formation of a repressive loop, while it promotes a second, larger, loop that enhances transcription in cotyledons (Gagliardi et al., 2019).
Because TE-derived IRs are frequently located near genes (Guo et al., 2017), we wondered whether the HaWRKY6 case was just one example of a more general phenomenon in plants. To evaluate if the insertion of an IR near a gene changes local chromatin topology and gene expression, we first aimed to identify all Irs neighboring protein-coding genes in the A. thaliana Col-0 reference genome. Using einverted from the EMBOSS program suite (Rice et al., 2000), we found a total of 885 IRs in the A. thaliana genome, 634 of then near annotated protein coding genes (222 of which were located within 500 bp upstream or downstream of a protein coding gene, a further 163 between 500 bp and 1,000 bp from a gene, and 249 between 1,000 and 3,000 bp from a gene) (Fig. 1A). IRs were found within 500 bp of 260 unique genes, within 500 to 1,000 bp of 215 unique genes, and within 1,000 to 3,000 bp of 615 unique genes (Fig. 1A). These IRs have a broad genome-wide distribution, with many located in the gene-rich chromosome arms, and others in gene-poor/TE-rich pericentromeric regions (Fig. 1B). Analyzing the overlap with annotated transposable elements (TEs) revealed that 68% of the IRs within 3,000 bp of a gene were clearly of TE origin, with most (∼44%) from the MuDR superfamily and 18% from the Helitron superfamily (Fig. 1C). These percentages remain invariable despite of the distance from the IR to the hosting gene although a moderate enrichment is LTR/Gypsy TEs between IRs not associated with genes (Fig. 1C). Using the plaNETseq dataset of RNAPII associated nascent transcripts (Kindgren et al., 2020), we found that more than half of the identified gene-associated IRs are transcribed by RNAPII (Fig. 1D). Supporting the idea that these IRs are transcribed from promoters of nearby genes, the fraction of IRs transcribed by RNAPII increases the closer the IRs are to a gene (Fig. 1D). Conversely, IRs located far from annotated genes were less likely to give rise to RNAPII transcripts. Instead, their transcription likely depends on the canonical RNAPIV/RDR2 RdDM pathway (Fig. 1D).
With small RNA (sRNA)- and bisulfite (BS)-sequencing, and paralleling what we observed in the case of the HaWRKY6 adjacent IR, we found that most of the identified IRs near genes produce RNAPII transcripts (Fig. 1D), giving rise to 24-nt siRNAs (486/634 IRs 3000bp away from genes), and are associated with CHG and CHH methylation (Fig. 1E). We often observed an additional peak of methylation, which could represent a second anchor point for chromatin loop formation, either on the opposite border or inside of many genes located near IRs (Fig. S1), once more providing a scenario similar to that of the HaWRKY6 locus where two close-by methylated regions served as anchor points for regulatory short-range chromatin loops.
To investigate whether the identified IRs, especially those producing siRNAs and DNA methylation, impact local chromatin organization, we extracted RNA and DNA from Col-0 wild-type plants, triple dcl2, dcl3, dcl4 (dcl234)mutants (which are in impaired in 24-nt siRNA production) (Lu et al., 2006), and triple drm1, drm2, cmt3 (ddc) mutants (impaired in CHH methylation) (Kurihara et al., 2008), and performed RNA-seq, sRNA-seq, BS-seq, and Capture-C. As the sequencing depth required to detect short-range chromatin interaction through standard Hi-C would be enormous, we selected 290 loci containing IRs and 40 control loci and performed a Capture-C experiment to only focus on these regions and increase the chances to detect local chromatin loops. A test mapping of Capture C reads confirmed the enrichment on the captured regions compared to the input HiC samples (Fig. S2A). Collectively, the designed probes cover ∼1% of the genome. After Capture-C we increased the ratio of reads mapping the targeted regions 40 times in average (Fig. S2B).
sRNA-seq revealed that 486 out of the 634 IRs 3000 bp from genes produce 24 nt siRNAs (Fig. 2A), with siRNA levels severely reduced in both dcl234 and ddc mutants, as expected for this class of small RNAs (Fig. 2A). Both CHG (∼50%) and CHH (∼70%) methylated regions strongly overlapped with IRs within 500 bp from genes (Fig. 1E), and methylation in both contexts is reduced in both in dcl234 and ddc for many IRs as expected by the reduction in siRNAs in these genotypes (Fig. 2B). The proportion of IRs with reduced methylation in the mutants is slightly higher for those within 500 bp of genes compared to the other analyzed distance windows (Fig S2C). Changes in siRNAs and methylation were correlated, consistent with a drop of siRNAs leading to reduced methylation in RdDM impaired mutants (Fig. 2C). Altogether, these data suggest that a large proportion of IRs located near genes and transcribed by RNAPII may be able to trigger DNA methylation in cis through the non-canonical RdDM pathway.
As in the case of the HaWRKY6 locus in sunflower, our data suggest that IRs located near genes can act as regulatory elements changing the epigenetic landscape of the region. In order to assess the impact of IR methylation on the surrounding chromatin organization, we used the software CHESS to compare the structural similarity (SSIM) of the IR-host regions between wild type and mutants (Galan et al., 2020). We also statistically determined the specific anchor points of chromatin loops in each sample using capC-Map (Buckle et al., 2019) in combination with peakC (Geeven et al., 2018). Analyzing individual loops, we found clear alterations in the chromatin topology in several randomly picked loci (Fig. 2D and S3). To compare the differential methylation-related changes in loops formed near IRs, we calculated the SSIM for each captured region plus 10 kb on each end in wild type and siRNA- or methylation-deficient mutants. The choice of a global similarity measure, the SSIM, to study short-range chromatin changes, rather than comparing individual loops, aimed at increasing the power to detect reliable differences, as random interactions increase with shorter distances and increase the methodological background noise. Moreover, other interactions, caused by dimerization of DNA-bound transcription factors or nucleosome packing, can also impact such analyses. Chromatin organization of the IR-containing loci, which should have SSIM values ∼1 if similar between the genotypes, were often changed in ddc and dcl234 mutants (Fig. 2E), indicating that (IR-triggered) methylation has a substantial effect on the local 3D topology. Such differences are less pronounced when comparing ddc with dcl234 as could be expected from the methylation deficiency observed in both, therefore lacking the anchor points for loop formation (Fig. 2E).
Changes in chromatin topology can affect gene expression (Domb et al., 2022). Given the alterations in chromatin topology that we found, we wondered whether they impact gene expression. Formation of a chromatin loops may repress or activate gene transcription, or even trigger production of alternative transcripts (Gagliardi and Manavella, 2020).
We found 4,305 differentially expressed genes (DEGs) in the ddc mutant in com-parison with Col-0, and 3,636 DEGs in dcl234, many of them overlapping be-tween the two mutants (Fig. S4A). When we compared the SSIMs in regions with DEGs with SSIMs in regions without DEGs we found topological differences (lower SSIM) to be significantly increased in regions that included DEGs (Fig. 2F). We then split those differences into regions linked to differentially and not differentially methylated IRs, which revealed greater topological differences in ddc mutants for regions both including DEGs and differentially methylated IRs (Fig. S4B). Methylation seemed to affect the SSIM correlation with DEGs less clearly in dcl234 (Fig. S4C). Loci with altered topology in ddc and dcl234 included both up- and downregulated genes (Fig. 2F). Thus, changes in chromatin topology caused by an impaired RdDM machinery could be linked to opposite changes in gene expression depending on the loci. This observation, also detected in rice for genes adjacent to MITEs (Lu et al., 2012), fits with short-range chromatin loops affecting RNAPII activity in different ways, depending on which part of the gene is included in the loop (Gagliardi and Manavella, 2020).
In summary, our data indicated that the insertion of an IR near a gene can trigger changes in the local chromatin topology that ultimately affect gene expression in a locus- and loop-dependent way.
The absence of IRs near genes between Arabidopsis accessions causes natural variation in the local chromatin topology
TEs are commonly silenced in plants as a means of protecting genome integrity. However, under extreme stress conditions TE transcription can be reactivated, giving TEs the potential to jump in the genome in a process that has been proposed to help adaptation to new environments (Ito et al., 2011; Tittel-Elmer et al., 2010). We wondered whether the IRs we were studying could have adaptative potential to change the structure of the RNAs produced by a locus or their expression, through changes in the 3D chromatin organization.
To test this hypothesis, we first used published TE polymorphism datasets (Stuart et al., 2016) to detect variations in the TE content of 216 A. thaliana genomes (Schmitz et al., 2013). Narrowing this down further to the 634 IRs located within 3,000 bp upstream or downstream of annotated genes in the Col-0 reference genome, we found 193 of these 634 IRs to be variable between the 216 accessions studied (Fig. 3A). IRs without a clear TE origin are underrepresented in the collection of polymorphic IRs (Fig. 3A). These 193 IRs are located within 3,000 bp of 368 annotated genes. This implies that each IR can influence two or more adjacent genes in many cases, an observation that is not surprising given the compact nature of A. thaliana genomes. To investigate the effect that the variation in the IR content may have on local chromatin topology and gene expression, we repeated the Capture-C, RNA-seq, BS-seq, and sRNA-seq analyses in the Ba1 and Hod accessions. We selected these two accessions based on the number of variable IRs present in each of them (46 and 31 IRs present in Col-0 are missing in Hod and Ba-1, respectively, with 18 missing in both accessions). With the Capture-C experiment, we captured 290 regions comprising the genomic sequences around variable IRs which included adjacent genes. The analysis of individual loci revealed that the presence of an IR at a locus correlated with the accumulation of 24 nt siRNAs and with distinctive short-range chromatin loops (Fig. 3B and S3).
We chose five potential candidate genes (PHV, PHYC, P5CS1, PHR1, CRY1) to evaluate the effect of polymorphic IRs on loop formation and gene expression. We corroborated by Sanger sequencing that each of the loci has IRs in Col-0, but not in the indicated accessions (Fig. 4B, 4G, andS5). We then used Chromatin Conformation Capture (3C) followed by qPCR to confirm and quantify the formation of IR-dependent chromatin loops at these loci, and RT-qPCR to measure correlation with gene expression in each locus (Fig. 3C, 4C, and 4I). We detected alternative chromatin loops at PHV, P5CS1, and PHR1 (Fig. 3C). In the case of PHV, loop 1 appeared to form independently of the associated IR, but loops 2 and 3 were missing in Ba-1, the accession without the IR, or in dcl234 mutants (Fig. 3C). For P5CS1, loop 2 appeared independently of the presence of the IR while the formation of an intragenic loop 1 correlated with the presence of the IR or a functional RdDM machinery (Fig. 3C). Both for PHV and P5CS1 the absence of a loop-triggering IR appeared to be associated with enhanced gene expression (Fig. 3C). In the case of PHR1, we found that the formation of a loop 1 depends on the presence of the IR, which is missing in the Hod accessions, but loop 2 was only formed when the IR is missing, probably reflecting a hierarchy of chromatin interactions controlled by the IR (Fig. 3C). Contrary to the results with PHV and P5CS1, the absence of the IR near PHR1 caused repression of the gene.
When we analyzed the genome-wide effects of insertional IR polymorphisms over genome topology using CHESS, we also found a drastic change in chromatin folding associated with the presence/absence of an IR (Fig. 3D). For this analysis we calculated the difference between the SSIMs obtained from the comparison of Ba-1 and Hod genome topology against Col-0, as recommended by the authors of CHESS, to have a common reference to score differences between these accessions (Galan et al., 2020). As it can be observed in Figure 3D, the difference in SSIMs are greater when Ba-1 shares the IR with Col-0, having a higher SSIM, but it is absent in Hod resulting in a lower SSIM and an increased positive difference.
These data suggest that the insertion of an IR near coding genes can have an impact on local chromatin topology that affects gene expression. Aiming to explore whether this could potentially have adaptive consequences, we asked whether there is a correlation between the presence/absence of an IR near a gene and well-defined phenotypic traits recorded in the Arapheno database (Seren et al., 2017). This analysis revealed an association between IR insertional polymorphisms near some genes with a given phenotype related to flowering, a typically adaptative trait (Fig. 3E). This observation suggests that insertion of an IR near a gene can not only impact the local chromatin topology, but that it may also have adaptive implications by changing the phenotypes controlled by the adjacent gene.
The insertion of an IR near genes causes strong phenotypic effects
Our data suggest that the insertion of an IR near a gene can be a significant event during adaptative evolution. However, this association may not be caused by the IR itself, but by adjacent polymorphisms that may be in linkage disequilibrium with the IR. In addition, even if the association is with the IR itself, it may not involve changes in the genome topology. To study this possibility we selected two loci, PHYC and CRY1, which display natural variation in the presence/absence of a close-by IR that is associated with developmental phenotypes (Fig. 3E). Both PHYC and CRY1 associated IRs are located downstream of the transcription termination site (TTS) (Fig. 4A and 4F), thus making it less likely that they act directly on the promoter of the genes as general regulatory elements.
PHYC encodes a photoreceptor capable of sensing red light (R) and far-red light (FR) and is implicated in several developmental transitions, such as flowering, seed germination and hypocotyl elongation (Chen et al., 2014; Kippes et al., 2020; Li et al., 2020; Nishida et al., 2013). In Col-0 an IR is located ∼500 bp downstream of the TTS of PHYC, while it is missing in several natural accessions, including, for example, Ey15-2 (Fig. 4A and 4B). To test whether this IR impacts PHYC expression and related phenotypes, we used a CRISPR/Cas9 strategy to delete a fragment of the IR, with which we can disrupt dsRNA formation without completely removing the IR sequence (Fig. 4A). Three homozygous lines were obtained with a deletion of the IR fragment (Fig. 4B). Supporting a role for this IR in modulating genome topology and gene expression, we detected a chromatin loop encompassing the entire PHYC gene in Col-0 plants that was absent in the CRISPR mutant lines and in the Ey15-2 accession, which lacks the IR (Fig. 4C). The absence of the IR, both in Ey15-2 and in the CRISPR lines, correlated with higher expression of PHYC, indicating that the IR promotes the formation of a repressive loop in Col-0 (Fig. 4C). Both, the CRISPR lines and Ey15-2, had altered developmental responses related to known PHYC functions, including delayed flowering and shortened hypocotyls under a continuous red-light treatment (Fig. 4D and 4E). Altogether, our data indicate that the presence of the IR next to PHYC has a substantial impact on gene regulation, by enabling the formation of a short-range chromatin loop that represses gene expression, thereby contributing to the differential response of natural accessions to light signals.
CRY1 is a blue light receptor and participates predominately in the regulation of blue-light inhibition of hypocotyl elongation and anthocyanin production (Ahmad et al., 1998; Ahmad et al., 1995; Liu et al., 2022). Two IRs are located near the gene, ∼2,000 bp upstream of the transcription start site (TSS) and ∼550 bp downstream of the TTS (Fig. 4F). The second IR is variable among A. thaliana accession, missing, for example, in Hod (Fig. 4G). Using CRISPR/Cas9 engineering, heterozygous lines could be obtained missing a fragment of the IR located downstream of the TSS (with homozygous lines apparently not being viable) (Fig. 4H). Similar to PHYC, we detected a chromatin loop that brings together the borders of the CRY1 gene only in those plants containing the IR and that represses the expression of the gene (Fig. 4I). The absence of the IR, both in the CRISPR lines or in the natural accession Hod, correlated with higher expression levels of CRY1 and shorter hypocotyls under blue light (Fig. 4I and 4J), a phenotype previously described in CRY1 overexpressing lines (He et al., 2019; Liu et al., 2022).
Discussion
How organisms can adapt to a rapidly changing environment is one of the most interesting questions in evolutionary biology (Barrett and Schluter, 2008; Hermisson and Pennings, 2005). SNPs have been the main focus of many genomic studies aiming to assess the evolutionary potential of mutations, but TEs can be particularly powerful actors in rapid adaptation, as single transposition events can have potentially wide-ranging consequences on gene expression and derived phenotypes (Baduel et al., 2021). On one side, the broad distribution of TEs across the genome facilitates the generation of chromosomal rearrangements through ectopic recombination. Even more significantly, the mobilization of TEs can disrupt, modify or even change the expression of genes in various ways that could generate a favorable adaptative trait (Dubin et al., 2018; Friedli and Trono, 2015). New alleles caused by TE insertions have been proposed to guarantee a consistent supply of potentially adaptable variants in response to the environment (Baduel et al., 2021). Still, many TEs inserted within gene-rich areas are quickly purged, according to population genomic surveys of TE polymorphisms (Quadrana et al., 2016). This is in agreement with transposition tending to produce alleles with negative effects.
Contrary to autonomous TEs, MITEs were found to be distributed on chromosome arms in plants, highly associated with genes, and frequently transcribed with adjacent genes (Kuang et al., 2009; Lu et al., 2012; Oki et al., 2008). In agreement with this, our study identified numerous TE-derived IRs near coding genes in A. thaliana. Different from rice, where more than half of the genes are associated with MITEs (Lu et al., 2012), fewer genes are associated with IRs in A. thaliana. This observation is not surprising as Arabidopsis is an outlier regarding TE content among plants, with only 15% of its genome represented by TEs. In comparison, they account for 85% of maize and up to ∼40% of rice genomes (Arabidopsis Genome, 2000; Li et al., 2017; Schnable et al., 2009).
The association between MITEs and protein coding genes suggested that these TEs may play essential roles in genome evolution. Current evidence suggests that siRNA-triggered TE methylation tends to cause the repression of neighboring genes (Hollister and Gaut, 2009; Hollister et al., 2011). However, in the case of MITEs, both positive and negative effects on the expression of host genes have been reported (Gagliardi et al., 2019; Underwood et al., 2022; Wu et al., 2022; Xu et al., 2020; Zhang et al., 2016). The weak correlation between methylation of MITEs and the expression of adjacent genes can now be better explained in the light of our findings showing that IRs located near genes affect the local chromatin organization. While methylation of DNA on its own is expected to have primarily repressive effects, short-range chromatin loops could produce many different outcomes, including transcriptional repression/activation and production of alternative mRNAs (Gagliardi and Manavella, 2020).
MITEs have the potential to transpose into various locations in the genome resulting in the presence/absence (insertional) polymorphisms between genotypes (Lu et al., 2012; Lyons et al., 2008). Such polymorphism can be caused by the insertion or excision of a MITE from a locus. However, it is unknown which of these scenarios contributes more to the genetic variation within a species. Here, we show that ∼30% of the IRs located within 3,000bp of protein coding genes present insertional polymorphism between 216 A. thaliana natural accessions. Our results also indicate that such natural variation of the gene-associated IR content can cause changes in chromatin organization that could be considered “3D polymorphisms”. Our experiments using not only association between IR polymorphisms and gene expression changes, but also CRISPR/Cas9-editing to demonstrate a causal relationship in several cases, show that the insertion of an IR near a gene can have a profound impact on chromatin organization, gene expression, and associated phenotypes. This phenomenon can potentially boost a plant’s capacity to adapt to a rapidly changing environment. Because IRs can coopt the promoters of adjacent genes to produce RNAPII-derived siRNAs through a stem-loop dsRNA intermediate, they can act as autonomous regulatory elements, drastically changing the chromatin landscape of a locus upon insertion. These characteristics turn IRs into powerful elements during adaptative evolution.
Modelling has suggested faster generation of large-effect alleles due to larger transposition rates in specific populations in response to global warming (Baduel et al., 2021). Consistent with this scenario, and in a world of a rapidly changing climate, the discovery of IRs as elements shaping the 3D chromatin organization and driving genome adaptation is of great interest. The manipulation of IRs, and in consequence, genome topology, can potentially become a powerful biotechnological tool to improve crop adaptation without the need to incorporate exogenous DNA or alter coding sequences in plants.
Material and methods
Plant material and growth conditions
Arabidopsis thaliana accessions Col 0, Hod, Ba1, EY15-2, dcl234 (Lu et al., 2006), and ddc (Kurihara et al., 2008) mutants and CRISPR mutant lines were grown at 23ºC in long day (16/8 hours light/dark). Blue and red-light experiments were carried out as follows: seeds were sowed on petri dishes with humidified filter paper and stratified in the dark for 5 days, and then transferred to white light (80 μmol m-2 s-1) at 23ºC for 3 hours and subsequently transferred to red (20 μmol m-2 s-1) or blue light (5 μmol m-2 s-1) in LED chambers. Hypocotyl measurements were taken at four days using ImageJ (http://rsb.info.nih.gov/ij).
To obtain plants with genomic fragments deletions a CRISPR/Cas9 vector toolbox (Wu et al., 2018) was used. Specific sgRNAs, described in Supplemental Table S1, were designed to obtain the IRs deletions. Col-0 plants were transformed using the floral dip method, T1 plants were selected based on the presence of red fluorescence in seeds under a fluorescent dissecting microscope (Leica, Solms, Germany), non-fluorescent T2 seeds missing the transgenes were grown and genotyped by PCR to identify effective deletions.
For genomic DNA extraction 100 mg of fresh plant material was ground in 700 µl of extraction buffer (200 mM Tris-HCl pH8, 25 0mM EDTA, 0.5% SDS) and precipitated with isopropanol. PCR was performed using primers detailed in Table S1. RNA was extracted using TRIzol™ Reagent (Invitrogen) following manufacturer’s recommendations.
The presence of the IR in different A. thaliana accessions was determine by amplification of the IR region with flanking primers with Q5 polymerase, followed by Sanger sequencing. For the deletion of the IRs neighboring PHYC and CRY1, pairs of sgRNAs targeting each locus were cloned in a CRISPR/Cas9 super module (SM) vector as described (Wu et al., 2018). Briefly, sgRNAs targeting flanking regions of the IR were designed using the CRISPR-P Web Tool (Lei et al., 2014). Each sgRNA was introduced into the shuffle vectors by overlap PCR with Q5 Hi-Fidelity polymerase followed by digestion of the original vector with DpnI (Thermo Scientific). A destination vector harboring UBQ10 promoter, pcoCas9, proLacZ:LacZ between both sgRNAs targeting each IR, and At2S3:mCherry for fluorescence selection in seeds were generated with the Green Gate assembly system. Destination vectors were transformed into Col-0 plants, and red fluorescence-positive seeds were isolated as hemizygous seeds. Transgene-free T2 offspring without seed fluorescence were chosen, and plants were tested by PCR and Sanger sequencing to identify IR deletion lines. All primers used are listed in Table S1
IR detection
Inverted repeats were identified in the Col-0 Arabidopsis thaliana genome using einverted, from the EMBOSS program suite (Rice et al., 2000). The paremeters used were: maximum repeat of 1000, a gap penality of 8, a minimum score threshold of 150, a match score of 3, and a mismatch score of -4.
RNA-seq
RNA-seq library preparation was performed as described (Cambiagno et al., 2021). An in-house scaled-down version of Illumina’s TruSeq reaction was used. mRNA was purified with NEBNext Poly(A) Magnetic Isolation Module (New England Biolabs, Ipswich, MA) and heat fragmented with Elute-Prime-Fragment buffer (5x first-strand buffer, 50 ng/ml random primers). For first- and second-strand synthesis SuperScript II Reverse Transcriptase (Thermo Fisher) and DNA polymerase I (NEB) were used, respectively. T4 DNA polymerase, Klenow DNA polymerase, T4 polynucleotide kinase (NEB), and Klenow Fragment (30/50 exo-) (NEB) were used for end repair and A-tailing. Ligation of universal adapters compatible with Nextera barcodes i7 and i5 was performed with T4 DNA Ligase (NEB), and Q5 Polymerase (NEB) was used for PCR enrichment using Nextera i7 and i5 barcodes. SPRI beads were used for DNA purification in each step and size selection of the library preps. 2 × 150 bp paired-end reads were generation on the Illumina HiSeq 3000 platform.
The analysis started by quality trimming and filtering the raw reads with Trimmomatic version 0.36; (Bolger et al., 2014. They were then aligned to the Arabidopsis thaliana genome (TAIR10) using STAR version 2.5.2b, {Dobin, 2013 #73), which was guided by the gene and exon annotation from Araport V11 201606, (Pasha et al., 2020). Samtools version 1.; (Li et al., 2009) was then used to keep only primary alignments with a minimum MAPQ of 3. Read quality before and after trimming was analyzed with FastQC (version 0.11.5; https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and, together with mapping efficiency, they were summarized with MultiQC version 1.7 (Ewels et al., 2016). Read counts on each gene were then calculated with featureCounts version 1.6.2 (Liao et al., 2014). This pipeline was run with the aid of the Snakemake workflow engine (Koster and Rahmann, 2012). Gene counts were used to identify differentially expressed genes with DESeq2 (Love et al., 2014); R Core Team 2022) filtering out genes with counts below 10 in all samples.
sRNA-seq
For sRNA-seq library preparation, 1 µg of total RNA was used as input for the TruSeq sRNA Library Preparation kit (Illumina) as described in the TruSeq RNA Sample Preparation v2 Guide (Illumina). BluePippin System (Sage Science) was used for sRNA library size selections. Sequencing was performed on the Illumina HiSeq 3000 platform.
The small RNA reads generated were first cut to remove 3’ adapters using cutadapt (version 1.9.1) and their quality checked using FastQC (version 0.11.4, https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and MultiQC (Ewels et al., 2016). They were then mapped with bowtie (version 1.1.2; (Langmead et al., 2009)) to A. thaliana rRNA, tRNA, snoRNA and snRNA from RFAM (version 14.1, (Kalvari et al., 2018). Unmapped reads were then mapped, also with bowtie, to the A. thaliana genome. Statistical analyses were performed in the R statistical programming environment (R Core Team, 2022) and graphics were produced with the ggplot package.
IRs which had 10 or more 24 nucleotide reads in its entire region were considered to have potential sRNA production. Changes in sRNA levels were calculated first calculating reads per million (RPM) mapping reads to the genome in each library, then averaging this value for all replicates in each IR, and then calculating the log2 Fold Change between the RPM in the mutant versus Col-0.
Bisulfite treatment of DNA and library preparation
For BS-seq, DNA extraction was performed with DNeasy plant Mini Kit (QUIAGEN). DNA was sheared to 350 bp by g by Covaris ultrasonication. Libraries were generated with Illumina TrueSeq DNA Nano Kit. After adaptor ligation, libraries were bisulfite converted with the Epitec Plus DNA Bisulfite Conversion Kit (QUIAGEN). Library enrichment was done using Kapa Hifi Uracil+ DNA polymerase (Kapa Biosystems, USA). Paired-end reads (2 × 150 bp) were generation on the HiSeq 3000 platform (Illumina).
The analysis of these reads started by quality trimming and filtering them with Trimmomatic (version 0.36; (Bolger et al., 2014)). Then we used the Bismark program (Krueger and Andrews, 2011) to perform the mapping of the reads to the A. thaliana Col-0 genome, internally done with Bowtie2 (Langmead and Salzberg, 2012), the deduplication of the alignments and the extraction of the methylation results in the three contexts: CG, CHG and CHH. This output was then analyzed in R (R core team, 2022) with the methylKit package (Akalin et al., 2012). Only Citosines with at least 4 reads were considered, and each sample was segmented with methSeg and methylation levels were calculated for those including at least 4 Cs. For Col-0, segments were collapsed for replicates using the mergeGRangesData function from the BRGenomics package (https://rdrr.io/bioc/BRGenomics/) and IRs with repeats overlapping segments with more than 10 or more percent of CHG or CHH methylation were considered methylated.
Differential methylation in the ddc and dcl234 mutants was also calculated with the mehtylKit package. First replicates were combined with the unite function and then differential methylation calculated with the calculateDiffMeth function, correcting for overdispertion with the MN method, using a q-value threshold of 0.1 and a differential threshold of 15 %. Then IRs with repeats overlapping any of these differential segments was considered differentially methylated.
Capture-C assay
For Capture-C, Hi-C was performed as described (Liu, 2017). Briefly, we collected 1.5 g of plant tissue, and fixed them with 1% formaldehyde. Nuclei were isolated and finally washed with NEB buffer #3. Nuclei penetration was done by resuspending the pellet in 150 μl 0.5% SDS and incubating them at 62 ºC for 5 min. After that, 435 ul of water and 75 ul of 10% Triton X-100 were added and incubated 37 ºC for 15 min. NEB buffer #3 was added to 1X, and 50 U of DpnII to digest the chromatin over night at 37°C. Incubating the digested chromatin with 10 U Klenow, dTTP, dATP, dGTP, and biotin-14-dCTP at 37°C for 2 h, cohesive ends were filled. Blunt-end ligation of chromatin was performed by adding blunt-end ligation buffer to 1X and 20 U of T4 DNA ligase at room temperature for 4 h. Nuclei were lysed with SDS buffer (50 mM Tris-HCl, 1% SDS, 10 mM EDTA, pH 8.0) and incubated with 10 μg proteinase K at 55°C for 30 min. To reverse the crosslinking, NaCl was added to reach 0.2 M and the samples were incubated at 65 °C overnight. Hi-C DNA was purified by Phenol-Chloroform-IAA method and RNAse A treated. Hi-C DNA was sheared to 500 bp with a Covaris E220 sonicator. DNA was purified and size selected (longer than 300 bp) using Ampure beads. Unligated biotin was removed in a reaction with 0.1 mM dTTP, 0.1 mM dATP and 5 U T4 DNA polymerase incubated at 20°C for 30 min. DNA was purified with Ampure beads and end-repair and adaptor ligation were performed with the NEBNext® Ultra™ II DNA Library Prep Kit by following the manufacturer’s instructions. Biotin affinity purification was then performed by using Dynabeads MyOne Streptavidin C1 beads (Invitrogen). Library amplification was done with Ultra II Q5 Master Mix with universal and selected index primers.
For the Capture step, hybridization capture was performed with the MyBaits system (Arbor Biosciences) following the manufacturer’s instructions. Baits of 80 nucleotides were designed on each end of the digestion fragments corresponding to the captured regions. These regions included all genes within 3000 bp of the IR and the spacer region up to the IR, excluding it. When a region was surrounded by 2 IRs, it was considered a single captured region.
Finally, Capture-C DNA was pared-end sequenced (2 × 150 bp reads) on an Illumina HiSeq 3000. The resulting reads were processed with capC-MAP (Buckle et al., 2019), which performs the in silico genome digestion, read alignment, the pile-up of interactions, and can generate normalized, binned and smoothed profiles of interaction for each target. For this the Col-0 genome was used, an exclusion zone of 500 bp, a bin size of 500 bp and a step of 250 bp. The results were then processed with the R package peakC (Geeven et al., 2018) to determine statistically significant interactions. These loops were visualized with the aid of the R package and Gviz (Hahne and Ivanek, 2016).
For the SSIM calculation, raw pileups were first normalized with FAN-C (Kruse et al., 2020), using the VC-SQRT method on 1000 bp bins. The SSIM value was obtained on a region comprising the captured region extended by 10 kb on both extremes, and using a relative window size of 0.1.
Data processing, plotting and statistical analysis
Data obtained in the different analysis of the sequencing experiments was further processed and statistically analyzed in R (R Core Team, 2022) using a diversity of packages. Genomic information was handled using GenomicRanges (Lawrence et al., 2013), Biostrings (https://bioconductor.org/packages/Biostrings) GenomicInteractions (Harmston et al., 2015) and rtracklayer (Lawrence et al., 2009) packages.
Plots summarizing information were mostly performed with the ggplot and ggpubr packages. Plots of genomic regions were produced with the Gviz (Hahne and Ivanek, 2016) package. Circular plots were generated with ciclize (Gu et al., 2014).
Data availability
All sequencing data were deposited at the European Nucleotide Archive (https://www.ebi.ac.uk/ena/) public repository with accession PRJEB53956.
3C assay and RT-PCR
3C assay was performed as described (Gagliardi et al., 2019). For detection of loops at PHV and PHYC, EcoRI (NEB) overnight digestion was performed; for P5CS1, CRY1, and PHR1, XbaI (NEB) was used. For DNA ligation, 100 U of highly concentrated T4 DNA ligase (Thermo) were used at 22°C for 5 h in a 4 mL volume. Reverse crosslinking and proteinase K treatment (Invitrogen) were performed, and phenol/chloroform method was used for DNA purification. For interaction frequency measurement, qPCR was performed using ACTIN2 as housekeeping gene. All primers used are listed in Table S1.
For quantitative RT-PCR, 1 μg of total RNA was used for reverse transcription reactions using RevertAid RT Reverse Transcription Kit (Thermo Fisher Scientific). qPCR was performed using SYBR green (Thermo Scientific Maxima SYBR Green qPCR Master Mix (2x)). Three biological replicates were used to calculate the standard error of the mean. Standard error of the mean (SEM) was calculated using propagation of error of the 2-ΔΔCt values and expressed in figures as two times the SEM. Statistical significance was tested using a two-tailed, unpaired Student’s t-test. All primers used are listed in Table S1.
Author contribution
A.L.A. performed the majority of the analyses. D.A.C. prepared the libraries for sequencing and Capture-C experiments. P.L.L. and H.A.B. helped with the design of the Capture-C probes. P.L.L. performed the Capture experiment. R.M. and D.A.C validated the chromatin loops formations and created the CRISPR/CAS9 mutant lines. R.M. characterized the CRISPR/Cas9 mutant lines and performed validation experiments. A.L.A., D.A.C., P.A.M., and D.W. conceived this study; P.A.M and D.W. supervised the work and secured project funding; A.L.A., R.M., D.W., and P.A.M. wrote the manuscript.
Footnotes
The authors declare no competing interest.
Acknowledgments
This work was supported by grants from ANPCyT (Agencia Nacional de Promoción Científica y Tecnológica, Argentina) and Universidad Nacional del Litoral (UNL) to P.A.M and the Max Planck Society to D.W. P.A.M., A.L.A. and D.A.C are members of CONICET; R.M is a fellow of the same institution. We thank the Deutscher Akademischer Austauschdienst (DAAD) and Company of Biologists for short-term fellowship to D.A.C. and R.M. respectively.