Abstract
Transposable elements (TE) are repetitive DNA that can create variability in genome structure and regulation. The genome of Rhizophagus irregularis, a widely studied arbuscular mycorrhizal fungus (AMF), comprises approximately 50% repetitive sequences that include transposable elements (TE). Despite their abundance, two-thirds of TE remain unclassified, and their regulation among AMF life-stages remains unknown. Here, we aimed to improve our understanding of TE diversity and regulation in this model species by curating repeat datasets obtained from chromosome-level assemblies and by investigating their expression across multiple conditions. Our analyses uncovered new TE superfamilies and families in this model symbiont and revealed significant differences in how these sequences evolve both within and between R. irregularis strains. With this curated TE annotation, we detected that the number of upregulated TE families in colonized roots is four times higher than extraradical mycelium, and their overall expression differs depending on the plant host. This work provides a fine-scale view of TE diversity and evolution in model plant symbionts and highlights their transcriptional dynamism and specificity during host-microbe interactions. We also provide Hidden Markov Model profiles of TE domains for future manual curation of uncharacterized sequences (https://github.com/jordana-olive/TE-manual-curation/tree/main).
Introduction
Arbuscular mycorrhizal fungi (AMF) are an ancient group of plant symbionts capable of colonizing thousands of different species. The AMF provides vitamins and minerals from the soil to the plants and receives carbohydrates and lipids in return 1. This relationship may have been established since plants first conquered the land 2, and there is evidence showing that AMF improve plant growth and ecosystem productivity 3. More recently, it was also reported that these symbionts can retain atmospheric carbon, indicating these play a significant role in the global climate balance 4,5. Nonetheless, despite their relevance for plant fitness and long-term evolution, AMF show low morphological variability with no apparent plant specificity 6.
In addition to being ecologically and agriculturally important, AMF harbours peculiar cellular features. Their spores and hyphae always carry thousands nuclei, and to date, no stage where one or two nuclei co-exist in one cell has ever been observed 7. Furthermore, despite their longevity, no sexual reproduction has been formally observed in these organisms. However, evidence has now shown that AMF strains can either carry one or two nuclear genotypes in their cells, a genetic characteristic found in sexually multinucleated strains of ascomycetes and basidiomycetes, suggesting that sexual reproduction does exist in these prominent symbionts 8,9.
The low morphological diversity of AMF masks remarkable differences in gene content and structure in this group. For example, species vary significantly in genome size and gene counts 10, and the variation in genome size is significantly correlated with the abundance of transposable elements (TE) 11. Recent studies based on chromosome-level assemblies of Rhizophagus irregularis, also showed this model species carries two different (A/B) compartments, which are reminiscent of the “two-speed genome” structure previously reported in plant pathogens 9,12. In AMF, the A-compartment carries most core genes and shows significantly higher gene expression, while the B-compartment has a higher density in repetitive DNA, as well as in secreted proteins and candidate effectors involved in the molecular dialogues between the partners of the mycorrhizal symbiosis 12.
TE are sparse repetitive DNA classified into retrotransposons (Class I), which use RNA molecules as intermediates for “copy-and-paste”, and transposons (Class II), which spread through “cutting-and-pasting” the DNA 13. Each class is divided into orders and superfamilies based on pathways of transposition and phylogenetic relationships 14,15, and thus classifying TE is important to describe the evolution of the genome and infer its impact on the biology of any organism 16,17. For example, by modifying chromatin status and attracting transcript factors, these elements can also promote regulation of the gene expression 18.
In the model AMF R. irregularis, over 50% of the genome is composed by TE 9,12,19. Their higher abundance in the B-compartment is linked to higher rates of rearrangements 9, and it was proposed that elevated TE expression in germinating spores may lead to new expansions of TE in this AMF species 20. Despite recent findings, key questions regarding the diversity and evolution of TE, as well as their role in mycorrhizal interactions, remain unanswered. For example, approximately two thirds of TE remain unclassified, making it difficult to infer their function in AMF genome biology and evolution 20,21. Similarly, because analyses of TE expression have so far centred on germinating spores, it is unknown how these elements are controlled during host colonisation, and whether some show host-specific regulation. The present study addresses these questions by providing an improved, curated classification of TE families in all R. irregularis strains with chromosome-level assemblies and by investigating their expression among multiple hosts.
Material and methods
Curation and classification of TEs families
We used chromosome level assemblies of five homokaryotic (4401, A1, B3, C2 and DAOM197198) 12 and four heterokaryotic strains (A4, A5, G1 and SL1) 9 Supplementary material S1) as a source to build repeat libraries. The manual curation for nonmodel species followed the most recommended guides 16,22. Firstly, the repeat libraries were generated using RepeatModeler2.0.3 22 with -LTRstruct mode for detecting Long Terminal Repeats sequences implemented by LTRharvest and LTR_retriever. The libraries from all strains were merged to create a single reference for the manual curation.
The unique library was submitted to TEclass which separates the sequences into the order level: nonLTR, LTR or DNA 23. This step helped us to distinguish orders with similar protein domains, such as DIRS (nonLTR) and Crypton (DNA). Tirvish, from genome-tools tools (http://genometools.org/), was used to detect Terminal Inverted Repeats (TIR) in DNA transposons elements (for order DNA/TIR) with the parameter - mintirlength 8. Hidden Markov Models (hmm) of specific TE superfamilies or orders domains were generated from a combination of conserved regions described in Supplementary material S2. The sequences for each domain were first aligned using mafft 24, converted to stockholm format using esl-reformat and finally submitted to hmmbuild (version 3.1b2) to generate the hmm profiles 25.
Lastly, we provided hmm profiles for detecting elements with reverse transcriptase (LINE, DIRS, PLE, LTR, Bel, Copia, Gypsy), specific transposase superfamilies (Academ, CMC, Ginger, KDZ, Kolobok, MULE, Merlin, Novsib, P, PIF-Harbinger, PiggyBac, Plavaka, Sola-1, Sola-2, Sola-3, Tc-Mariner, Transib, Zator and hAT) and tyrosine recombinase (DIRS and Crypton) (https://github.com/jordana-olive/TE-manual-curation/tree/main/TE-domains). The open reading frames (ORF) from the reference library were generated using the getorf with 200 amino acids as minimum size (https://www.bioinformatics.nl/cgi-bin/emboss/help/getorf). The hmmrsearch (version 3.1b2) was used to find the sequences with TE domains from above step. We removed sequences with expanded genes domains (Sel1, BTB, Kelch, Protein Kinase, TPR), the models are available at https://github.com/jordana-olive/TE-manual-curation/tree/main/expanded-genes.
A transposon (DNA/TIRS) was kept in the final library when detected as DNA by TEclass, presents a transposon domain and TIR sequence, with a size between 1kb to 17kb. Sequences with TIRS, ranging between 50 to 1kb and lacking transposase domains were classified as MITEs. Due to the similarity and closest relationship of certain elements, after the hmmrsearch, we constructed a phylogenetic tree with the putative LINE, DIRS, PLE, LTR, Ginger, Crypton and Maverick sequences. The protein sequences of these above-mentioned orders were aligned using mafft 24, then submitted to RAxML 26 for generating the phylogenetic tree, using PROTGAMMA model with 1000 of bootstraps. The best tree resolution was visualized using ggtree in R 27. The final classification of these sequences was based on the relationships (Supplementary Figure S1).
In summary, a known sequence (with classification from RepeatModeler) was kept if correctly detected by TEclass and presence of a respective transposition domain. A new sequence was classified according to TEclass order, presence of a transposition domain and its relationship to a known sequence according to the phylogenetic tree. The final library, and models are available on https://github.com/jordana-olive/TE-manual-curation/tree/main and can be reproduced in any other reference to custom TE characterization.
Repeat landscapes of genomes and compartments
We ran the RepeatMasker (version 4.1.2-p1) 28 for all strains, with the parameters -a -s - lib using the curated library. The repeat landscapes were generated from modified createlandscape.pl and calculedivergence.pl scripts, provided in RepeatMasker files. The landscapes also were generated to available A and B-compartment 9,12. We applied a paired t-test on R of the TE percentages along Kimura divergence points in A and B-compartments annotation. A variation between the landscape was considered significative when p < 0.05.
Transposable element and gene expression analysis
We evaluate the expression of genes and TEs using available RNAseq from different tissues (germinating spores 20, intraradical mycelium, arbuscules 29 and extraradical mycelium 30), and mycorrhized roots from different plant hosts colonized by DAOM197198 (Allium schoenoprasum, Medicago truncatula, Nicotiana benthamiana 29 and Brachypodium distachyon 31). The accesses to the data are described in Supplementary material S3. The reads were filtered using Trimmomatic 32 and aligned to DAOM197198 reference genome using Bowtie2 33. The read count was accessed by TEtranscripts 34 guided by the TE annotation performed in this study and gene annotation executed by 12. Using DESeq2 35, for each host condition and tissue, the differential expression was generated comparing germinated spores as control. A transcript was considered differentially expressed when padj (adjusted p-value) is equal or lower than 0.05.
Transposable element nearby gene correlation
For local copy expression, we used available RNAseq generated through Oxford Nanopore Technology (ONT) sequencing 19. The long ONT-RNAseq reads were filtered and trimmed using pychopper (https://github.com/epi2me-labs/pychopper) (7.98% did not pass the quality parameters and were discarded). The nucleotide correction was performed based on self-clustering using isONTcorrect 36. The filtered and corrected reads were aligned to DAOM197198 genome using hisat2 and annotated using stringtie 37 guided by TE annotation performed in this study and gene annotation executed by 12. In the same way, stringtie generated the counts of the transcripts, used in the expression analysis. For detecting TEs upstream of genes, we extracted the genomic regions up to - 1000 to the transcript start position and then intersected with TE annotations using bedtools 38. The correlation function based on the Pearson method in R was used to determine the association between genes and their TE upstream pairs.
Results
A curated database reveals new TE families in Rhizophagus irregularis
Using RepeatModeler and RepeatMasker, recent analyses of R. irregularis chromosome-level datasets indicated that strains of this species carry an average of 50% of TE 9,12. However, only one third of their repeat content could be classified, and thus on average 30% of all available genomes are composed by unclassified TE sequences 9,12,19,21. To address this, we used five chromosome-assembly level from homokaryon and four heterokaryon strains to generate curated repeat libraries. When all genomes are considered, out of total of 9,257 TE sequences identified by RepeatModeler, only 2,369 (approx. 25%) can be considered well defined, bona-fide non-redundant consensus sequences harboring transposition domains. The notable reduction in TE numbers is due to non-curated datasets containing highly degenerated TE without domains (relics), and repeats that cannot be classified with current knowledge of TE evolution.
In our final curated library, 1,458 sequences belong to families previously classified by RepeatModeler (K, Figure 1a), while 636 sequences represent newly identified families (N, Figure 1a) of the following orders; SINE, LINE, LTR, DNA/TIRS, DNA/Crypton, Maverick and RC/Helitron orders (Figure 1a).
Using the curated library to annotate TE in the DAOM-197198 genome, sequences belonging to these orders increased in number by respectively 4-fold on average, and similar results were obtained for all investigated R. irregularis strains (Figure 1b). We also assessed the influence of the manually curated library on the repeat landscape of the DAOM197198 genome (see Figures 1c-e). Following TE annotation, the percentage of the DAOM197198 genome is represented by classified TE increased from 12% (Figure 1c-d) to 36% (Figure 1e).
The newly curated R. irregularis TE library now includes novel families of non-LTR retrotransposons: SINE, LINE/CR1-Zenon, LINE/L1-Tx1, LINE/L1-R1, LINE/R2-Hero, LINE/RTE-BovB, LINE/I, which were detected based on phylogenetic analyses (Supplementary Figure S1). A novel superfamily of DNA transposons – e.g., Transib, - and novel families belonging to known AMF TE superfamilies were also detected, including Crypton, CMC, hAT-19, MULE, Sola-2, Tc1-Mariner and Plavaka (Figure 2).
The Plavaka family, which is part of CACTA/CMC/EnSpm superfamily, is particularly prevalent in DAOM197198, A1, B3, C2, A5 and G1. This family has already been identified in fungi 39,40, but is not deposited in publicly available repeat databases, which is why the RepeatModeler could not detect this family.
Repeat landscapes differ among and within R. irregularis strains
In the R. irregularis isolate DAOM197198, TE are differently mobilized and retained across the genome 20, but we found these patterns can differ among R. irregularis strains with chromosome-level assemblies. Specifically, in most of strains TE expansions are recent, as highlighted by the high number of elements with Kimura substitutions ranging from 0 to 5, supporting the findings that new TE expansion bursts exceed older TE insertions in DAOM197198 20 (Figure 3). However, other strains can have different TE distributions. For example, the strains A4 and SL1 carry a larger proportion of TE with older Kimura substitution levels, indicating strain-specific patterns of TE emergence, evolution, and retention. Remarkably, the genome of C2 has a much larger proportion of younger, and thus likely more active TE, and it is plausible that this feature is linked to the larger genome size of this strain compared to relatives - i.e., the C2 genome is 162Mb compared to an average of 147Mb for other R. irregularis strains (Figure 3).
The distribution of TE based on Kimura substitution levels also differ significantly between A and B-compartment in all strains (Figure 4). Specifically, the proportion of TE with Kimura substitutions between 10 and 20 – i.e., older TE - is always higher in the B-compartment (p<0.05), indicating that these elements are maintained over time at higher levels in this portion of the genome. In contrast, old and degenerated TE appear to be more rapidly eliminated in the A-compartment (Figure 4).
TE and gene expression is positively correlated within the B-compartment
The localization of TE nearby genes can impact their expression 18,41. We tested if the presence of TE upstream of genes is linked with gene expression in the compartments by identifying TE located 1000 nucleotides upstream of genes using available long reads RNA data 19. We found no correlation between the expression of TE and coding genes located in A-compartment (r=0.21, p<0.0001) (Figures 4f), however, a positive and significant correlation exists for genes and TE in the B-compartment (r=0.84, p<0.0001) (Figures 4g). This finding suggests that TE and genes in B-compartment are being co-expressed.
Transposable elements are significantly more expressed in colonized roots compared to extraradical mycelium
To obtain additional insights into the biology of TE, we investigated their regulation during host colonization using R. irregularis DAOM197198 RNA-seq data from multiple tissues, including micro dissected cells of arbuscules (ARB) and intraradical mycelium (IRM), and extraradical mycelium (ERM) (Figure 5a) in symbiosis with Medicago truncatula roots (Figures 5b-d).
The number of upregulated TE is more than four times higher in IRM and ARB than in extraradical mycelium in the same host (Figure 5b-d). TE significantly upregulated in colonized tissue (ARB and IRM) include DNA/TIRS, LTR, LINE and SINES orders, among them, LTR/Gypsy, LINE, CMC-EnSpm, Plavaka, hAT-Tag1, hAT-Ac and Helitron are the superfamilies with more expressed families (Supplementary Material S4).
One mechanism to control TE mobilization is through RNAi and AGO proteins 42. The R. irregularis DAOM197198 genome contains 25 AGO domain proteins 43, and we found that ERM samples express significantly more (p < 0.05) AGO genes compared to the other conditions (Figure 5b and f). In stark contrast, arbuscules and intraradical mycelium laser dissected cells have significantly reduced AGO expression, which again differs from the significantly higher expression of TEs in these conditions (p < 0.05) (Figure 5b and f).
TE regulation changes with different hosts and correlates with host genome size
Analyses of RNA-seq data roots from multiple hosts colonized by DAOM197198 (Allium schoenoprasum, Medicago truncatula, Nicotiana benthamiana and Brachypodium distachyon), also reveals that TE expression differs significantly among hosts, with some families being expressed only in one of four hosts investigated (Figure 6 and Supplementary Material S4). For example, a total of 404 families are upregulated during colonization with A. schoenoprasum roots (Figure 6a), including both families from Class II elements (e.g., Sola-1, hATm, Academ-1, DIRS) and Class I elements (e.g. CR1-Zenon and DIRS) that are only upregulated with this host. In N. benthamiana and M. truncatula roots, the symbiont upregulates a smaller number of TE families compared to A. schoenoprasum, 251 and 197 respectively, while B. distachyon was the condition with lowest TE upregulation overall (116 families) (Figure 6d).
The differences in TE expression among hosts are not linked to variability in expression of AMF AGO genes – i.e., hosts with highest TE expression do not always have low AGO expression and vice-versa. However, a significant and positive correlation (r = 0.91, p = 0.001) exist between the repeat-content of the host genome and number of overexpressed families in the DAOM197198 symbiont. For example, A. schoenoprasum is the host with most TE and his colonized roots have the highest AMF TE upregulation. In contrast, B. distachyon (272Mb) has the lowest TE content and has significantly lower TE upregulation in the symbiont.
Discussion
An improved view of TE family diversity and evolution in a model plant symbiont
Through manual curation of R. irregularis repeat libraries, we first improved the proportion of annotated families in these model plant symbionts and uncovered novel sequences representing the largest proportion of their repetitive sequences. Our work also revealed how R. irregularis strains differ in how they maintain or remove TE over time within two genome compartments. For example, our findings indicate that strains with largest genome sizes (C2) derive from a combination of a higher rate of TE emergence and retention of these elements. We also uncovered notable differences in how TE evolve within each strain, with some (A1, B3) having have much higher proportions of very young TE compared to others (SL1, A4) that carry similar levels of Kimura substitution rates indicative of early repeat degeneration and fewer recent expansion.
TE retention rates are different between compartments
By investigating the degree at which TE accumulate mutations over time, a significant distinction emerged between A/B compartments. Specifically, all A-compartment landscapes show a continuing invasion of novel/young TE insertions, and the high methylation present in this compartment 12, and/or purifying selection, might be needed for their control and rapid removal; as evidenced by the lower TE density in this compartment. In contrast, these insertions accumulate in the B-compartment, leading to a notable inflation of these elements over time, as shown by the stable TE density along the axis that defines the Kimura substitution rates. The accumulation of TE in the B-compartment might be linked with their domestication 44,45 and/or the emergence of new functions, a view supported by the significant positive correlations we observed in expression of TE and genes and similar findings from plant pathogens 46,47.
TE regulation and evolution further underpin similarities between AMF and known fungal pathogens
Obvious similarities between the genomes of AMF and those of plant pathogens have been known for some time 48,49. These include enrichments in transposable elements, and genomes subdividing into highly diverging regions dense in effector genes and TE, and more conserved ones composed of core genes and low repeat density.
In the plant pathogen Verticilium dahiliae, TE often locate in proximity to highly expressed pathogenicity-related genes within fast evolving adaptive genomic regions 50. These regions are reminiscent of R. irregularis B-compartments 12,49, which are also enriched in TE and secreted proteins that promote symbiosis with different hosts 51. As such, the significant correlation we observed between TE and genes specific to the B-compartment may mirror identical processes in AMF and plant pathogens.
The increased upregulation of genes in close proximity to TE during the colonization stages in the B-compartment could also indicates a TE control for de-repressing those regions of the genome 52. Indeed, it has been proposed that TE-effectors regulation is well-timed; i.e. both are expressed during the infection and repressed in the absence of the host 47. In filamentous fungi, the variability of TE can also allow for escaping mechanisms of recognition by the plant immunity system 46,47, and it is thus possible that in AMF the expression of transposable elements could aid plant-symbiont communication during colonization.
What drives TE upregulation and host-specificity during colonization?
TE expression is active in germinating spores 20, but our study shows their expression in colonized roots is much higher still. Within this context, we found that the AGO genes, which are known TE regulators 53, are significantly more expressed in extraradical mycelium compared to colonized roots 43, suggesting RNAi is one of the key factors implicated in the regulation of TE across stages of the mycorrhizal symbiosis -i.e. downregulation in extraradical mycelium, and upregulation in planta.
Notably, AMF AGO expression did not vary significantly among hosts, and thus other factors could be responsible for the host-specific variation in the TE expression we observed. One intriguing possibility is that TE expression in the hosts directly or indirectly influences TE expression in the AMF, as seen in multiple plant-microbe interactions in cross-talk regulations 54–57. In support of this view, we found a positive correlation between TE abundance in the host and TE expression in AMF. With mounting evidence of molecular cross-communication between the mycorrhizal partners, including RNA from the fungi interacting with mRNA from the hosts 43,58,59, it is likely that molecular dialogues between hosts also result in the increased TE expression we observed in the fungal symbiont.
Competing interests
The authors declare no competing interests.
Author contributions
JINO and NC designed the study. JINO performed the experiments and analysis of the data. JINO and NC wrote the paper. NC mentored and supervised all the processes.
ORCID
Jordana Inácio Nascimento Oliveira https://orcid.org/0000-0003-2511-1746
Nicolas Corradi https://orcid.org/0000-0002-7932-7932
Data availability
The genomes and RNAseq used in this study are described in Supplementary material S1 and S3. The ONT RNAseq can be accessed at SRR21968700.
Supporting information
Supplementary file S1 – Chromosome level assemblies of R. irregularis strains used in this study.
Supplementary file S2 – Domains used to construct the hmm models.
Supplementary file S3 – RNAseq data and alignments stats used in this study.
Supplementary file S4 – Number of upregulated TE in each condition.
Supplementary figure S1 – Phylogenetic tree of TE from different orders with similar domains. Figure in high resolution can be downloaded here https://github.com/jordana-olive/TE-manual-curation/blob/main/Supplementary-figure-S1.png.
Acknowledgements
Our research is funded by the Discovery program of the Natural Sciences and Engineering Research Council (RGPIN2020-05643) and a Discovery Accelerator Supplements Program (RGPAS-2020-00033). NC is a University of Ottawa Research Chair in Microbial Genomics. JINO is founded by Mitacs Accelerate Program (IT16902) and a Discovery Accelerator Supplements Program (RGPAS-2020-00033).
Footnotes
Minor correction of some typos and grammar improvement.